Table of Contents

Supp Material for Luminous Ostracod Vargula tsujii and Non-Luminous Ostracod Skogsbergia sp.

The R Juypter Notebook serves as a comprehensive repository encompasing most of the scripts and figures relevant to the luminous ostracod Vargula tsujii and non-luminous ostracod Skogsbergia sp. analyses outlined in the publication. This notebook includes the following analyses: QC steps, WGCNA, Network Visualization, Network Connectivity, Differential Gene Expression and GO enrichments.

Author: Lisa Yeter Mesrop

Load libraries

In [ ]:
#load libraries 
library(WGCNA)
library(tidyverse) 
library(edgeR)
library(matrixStats)
library(DESeq2)
library(dplyr)
library(readxl)
library(data.table)
library(ggplot2)
library(ggrepel)
library(repr)
library(topGO)
library(reshape2)
library(plyr)
library(scales)
library(readxl)
library(repr)
library(RColorBrewer)
library(pheatmap)
library(flashClust)
library(ggvenn)


#always use the following WGCNA functions 
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
allowWGCNAThreads(nThreads = 22)

Load data

The initial gene expression matrix consists of 63 samples and 80911 genes before quality control (QC).

In [2]:
#read in the merged counts 
merged_counts <- read.csv("July2024_updated_combined.csv", header = TRUE, sep = ",")
In [3]:
#set the first row as row names and remove it from the data frame
row.names(merged_counts) <- merged_counts[, 1]
merged_counts <- merged_counts[, -1]
In [4]:
#create the metadata 
meta <- data.frame(names = colnames(merged_counts))
In [5]:
print(meta$names)
 [1] "A.1.counts.tab"    "A.2.counts.tab"    "A.3.counts.tab"   
 [4] "AIF.1.counts.tab"  "AIF.2.counts.tab"  "AIF.3.counts.tab" 
 [7] "AII.1.counts.tab"  "AII.2.counts.tab"  "AII.3.counts.tab" 
[10] "AIII.1.counts.tab" "AIII.2.counts.tab" "AIII.3.counts.tab"
[13] "AIM.1.counts.tab"  "AIM.2.counts.tab"  "AIM.3.counts.tab" 
[16] "AIV.1.counts.tab"  "AIV.2.counts.tab"  "AIV.3.counts.tab" 
[19] "AV.1.counts.tab"   "AV.2.counts.tab"   "AV.4.counts.tab"  
[22] "B.1.counts.tab"    "B.2.counts.tab"    "B.3.counts.tab"   
[25] "C.1.counts.tab"    "C.2.counts.tab"    "C.3.counts.tab"   
[28] "E.1.counts.tab"    "E.2.counts.tab"    "E.3.counts.tab"   
[31] "F.1.counts.tab"    "F.2.counts.tab"    "F.3.counts.tab"   
[34] "G.1.counts.tab"    "H.1.counts.tab"    "H.2.counts.tab"   
[37] "M1.1.counts.tab"   "M2.1.counts.tab"   "M3.1.counts.tab"  
[40] "vtu10.counts.tab"  "vtu1.counts.tab"   "vtu2.counts.tab"  
[43] "vtu3.counts.tab"   "vtu4.counts.tab"   "vtu5.counts.tab"  
[46] "vtu7.counts.tab"   "vtu8.counts.tab"   "vtu9.counts.tab"  
[49] "Vt.1A.counts.tab"  "Vt.1B.counts.tab"  "Vt.1C.counts.tab" 
[52] "Vt.2A.counts.tab"  "Vt.2B.counts.tab"  "Vt.2C.counts.tab" 
[55] "Vt.3A.counts.tab"  "Vt.3B.counts.tab"  "Vt.3C.counts.tab" 
[58] "Vt.4A.counts.tab"  "Vt.4B.counts.tab"  "Vt.4C.counts.tab" 
[61] "Vt.5A.counts.tab"  "Vt.5B.counts.tab"  "Vt.5C.counts.tab" 
In [6]:
sample_name = c("A", "A", "A", "AIF", "AIF", "AIF", "AII", "AII", "AII", "AIII", "AIII", "AIII", "AIM", "AIM", "AIM", "AIV", "AIV", "AIV","AV", "AV", "AV", "B", "B", "B",
                "C", "C", "C", "E", "E", "E", "F", "F", "F", "G", "H", "H", "M", "M","M", "upper_lip","upper_lip","upper_lip","upper_lip","upper_lip","upper_lip","upper_lip","upper_lip",
                                  "upper_lip", "upper_lip", "eye", "gut", "upper_lip","eye", "gut", "upper_lip", "eye", "gut","upper_lip","eye", "gut", "upper_lip","eye", "gut")
In [7]:
meta$sample_name <- sample_name

QC for downstream analyses

As a preliminary quality control (QC) measure for WGCNA analysis, the overall similarity between samples and transcripts with low counts was assessed, as these counts often introduce noise in co-expression analyses. A filter was applied to the expression matrix, removing transcripts with fewer than 5 counts in more than 3 samples, given that some sample types included a minimum of 3 biological replicates. The QC analyses utilized functions from the DESeq2 package (Love et al., 2014).

In [8]:
#import count table, meta and sample_name objects into a DESeq2 object 
dds_count_table <- DESeqDataSetFromMatrix(countData = merged_counts, colData = meta, design = ~sample_name)
Warning message in DESeqDataSet(se, design = design, ignoreRank):
“some variables in design formula are characters, converting to factors”

Barplot of counts for all samples

The counts of filtered reads were examined using a bar plot.

In [9]:
#the number of prefiltered counts for each sample 
colSums(assay(dds_count_table))
A.1.counts.tab
1713711
A.2.counts.tab
1130600
A.3.counts.tab
1309974
AIF.1.counts.tab
2252469
AIF.2.counts.tab
2281888
AIF.3.counts.tab
1355952
AII.1.counts.tab
1734077
AII.2.counts.tab
1564175
AII.3.counts.tab
1828675
AIII.1.counts.tab
1669193
AIII.2.counts.tab
1392693
AIII.3.counts.tab
1422755
AIM.1.counts.tab
1496750
AIM.2.counts.tab
1392039
AIM.3.counts.tab
1698406
AIV.1.counts.tab
1136629
AIV.2.counts.tab
1347465
AIV.3.counts.tab
1196839
AV.1.counts.tab
170933
AV.2.counts.tab
125589
AV.4.counts.tab
265684
B.1.counts.tab
1384976
B.2.counts.tab
1650122
B.3.counts.tab
1546708
C.1.counts.tab
1767342
C.2.counts.tab
1349145
C.3.counts.tab
1524515
E.1.counts.tab
1821688
E.2.counts.tab
1819465
E.3.counts.tab
1671419
F.1.counts.tab
828106
F.2.counts.tab
1553937
F.3.counts.tab
1612315
G.1.counts.tab
2153351
H.1.counts.tab
1601848
H.2.counts.tab
1492178
M1.1.counts.tab
208287
M2.1.counts.tab
251441
M3.1.counts.tab
144724
vtu10.counts.tab
906794
vtu1.counts.tab
985761
vtu2.counts.tab
449438
vtu3.counts.tab
904537
vtu4.counts.tab
912254
vtu5.counts.tab
737160
vtu7.counts.tab
1501034
vtu8.counts.tab
502942
vtu9.counts.tab
930085
Vt.1A.counts.tab
1872112
Vt.1B.counts.tab
835793
Vt.1C.counts.tab
1467324
Vt.2A.counts.tab
1434424
Vt.2B.counts.tab
356946
Vt.2C.counts.tab
2046959
Vt.3A.counts.tab
1520743
Vt.3B.counts.tab
375309
Vt.3C.counts.tab
2648911
Vt.4A.counts.tab
1264138
Vt.4B.counts.tab
997891
Vt.4C.counts.tab
1733843
Vt.5A.counts.tab
1720769
Vt.5B.counts.tab
826908
Vt.5C.counts.tab
2158009
In [10]:
#function from the R package repr to visualize figures in Jupyter Notebook(optional)
options(repr.plot.width=10, repr.plot.height=10)
In [11]:
#extract sample names for barplot 
names <- dds_count_table$sample_name
In [12]:
#visualize prefiltered raw counts in a barplot  

librarySizes <- colSums(assay(dds_count_table))

par(mar=c(10,5,2,2))  
barplot(librarySizes, 
        names=names, 
        las=2, 
        cex.names=.5) +title(main = "Barplot of Count Distributions of Samples", line = -1, outer = TRUE)

Remove samples with low counts for WGCNA

Based on the bar plot count distribution in Section 3.1, the following six samples were removed due to very low counts: M1.1.counts.tab, M2.1.counts.tab, M3.1.counts.tab, AV.1.counts.tab, AV.2.counts.tab, AV.4.counts.tab.

In [4]:
#update the merged counts, meta and sample_name objects with the six samples removed 
merged_counts_subset <- subset(merged_counts, select= -c(M1.1.counts.tab, M2.1.counts.tab, M3.1.counts.tab, AV.1.counts.tab, AV.2.counts.tab, AV.4.counts.tab))
meta_subset <- data.frame(row.names = colnames(merged_counts_subset)) 
sample_name_subset = sample_name = c("A", "A", "A", "AIF", "AIF", "AIF", "AII", "AII", "AII", "AIII", "AIII", "AIII", "AIM", "AIM", "AIM", "AIV", "AIV", "AIV", "B", "B", "B",
                "C", "C", "C", "E", "E", "E", "F", "F", "F", "G", "H", "H", "upper_lip","upper_lip","upper_lip","upper_lip","upper_lip","upper_lip","upper_lip","upper_lip",
                                  "upper_lip", "upper_lip", "eye", "gut", "upper_lip","eye", "gut", "upper_lip", "eye", "gut","upper_lip","eye", "gut", "upper_lip","eye", "gut")

meta_subset$sample_name_subset <- sample_name_subset
meta_subset$names_subset <- rownames(meta_subset)
rownames(meta_subset) <- NULL
In [5]:
nrow(meta_subset)
57

Perform variance-stabilizing transformation for WGCNA

Perform variance-stabilizing transformation using DEseq2 package (Love et al., 2014). The authors of WGCNA recommend variance-stabilizing transformation as a normalization method before conducting network analyses (Langfelder and Horvath, 2008).

In [6]:
#import updated count table, meta and sample_name objects into a DESeq2 object 
dds_count_table_subset<- DESeqDataSetFromMatrix(countData = merged_counts_subset, colData = meta_subset, design = ~sample_name_subset)
Warning message in DESeqDataSet(se, design = design, ignoreRank):
“some variables in design formula are characters, converting to factors”
In [7]:
#filter the count table 
dds_merged_table_prefiltered <- dds_count_table_subset[rowSums(counts(dds_count_table_subset) >=5) >=3,]
In [8]:
nrow(dds_merged_table_prefiltered)
33479
In [9]:
#run DESeq2 function with variance-stabilizing transformation 
dds_subset <- DESeq(dds_merged_table_prefiltered, betaPrior = FALSE, parallel = TRUE)
#perform a variance-stabilizing transformation
vsd_subset <- varianceStabilizingTransformation(dds_subset)
estimating size factors

estimating dispersions

gene-wise dispersion estimates: 38 workers

mean-dispersion relationship

final dispersion estimates, fitting model and testing: 38 workers

-- replacing outliers and refitting for 7490 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)

estimating dispersions

fitting model and testing

Visualize transformed matrix with PCA

In [13]:
#perform PCA 

pcaData <- plotPCA(vsd_subset, intgroup="sample_name_subset", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=sample_name_subset, shape=sample_name_subset)) +
  geom_point(size=5) + 
labs(color = "Sample Types")+ labs(shape = "Sample Types")+
scale_shape_manual(values = c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))+
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()+theme(
    panel.grid.major = element_line(colour = "gray97", size = 1),
    panel.grid.minor = element_line(linetype = "dotted"),
    panel.background = element_rect(fill = NA),
    legend.key = element_rect(fill = "gray100"),
    axis.line = element_line(size = 0.5, linetype = "solid"),
   panel.border = element_rect(colour = "black", fill = NA, size = 1)
  ) + theme(
  axis.title.x = element_text(size = 16),  
  axis.title.y = element_text(size = 16), 
  legend.title = element_text(size = 16), 
  legend.text = element_text(size = 12)    
)

WGCNA

After completing the prefiltering and normalization steps above, the count matrix now includes 57 samples and 33479 transcript and is ready for WGCNA. WGCNA (Weighted Gene Co-expression Network Analysis) identifies clusters (modules) of highly correlated genes by constructing a network based on pairwise correlations between gene expression profiles (Langfelder and Horvath, 2008). These modules often correspond to specific biological processes or pathways, indicating that the genes within a module may be part of the same regulatory process. By analyzing the connectivity of genes within each module, WGCNA also helps identify key drivers or hub genes, providing insights into gene regulation and the biological processes they govern. The following scripts are from the WGCNA package (Langfelder and Horvath, 2008).

Load data

In [14]:
#use the prefiltered and normalized count matrix from the DEseq2
vsd_subset_matrix <- assay(vsd_subset)
## many functions expect the matrix to be transposed
datExpr <- t(vsd_subset_matrix) 
## check rows/cols
nrow(datExpr)
ncol(datExpr)
57
33479

Pick a soft-threshold power

In [15]:
#ready to start WGCNA Analysis 

#run this to check if there are gene outliers
gsg=goodSamplesGenes(datExpr, verbose = 3)
gsg$allOK
 Flagging genes and samples with too many missing values...
  ..step 1
TRUE
In [16]:
# choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
pickSoftThreshold: will use block size 1336.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 1336 of 33479
   ..working on genes 1337 through 2672 of 33479
   ..working on genes 2673 through 4008 of 33479
   ..working on genes 4009 through 5344 of 33479
   ..working on genes 5345 through 6680 of 33479
   ..working on genes 6681 through 8016 of 33479
   ..working on genes 8017 through 9352 of 33479
   ..working on genes 9353 through 10688 of 33479
   ..working on genes 10689 through 12024 of 33479
   ..working on genes 12025 through 13360 of 33479
   ..working on genes 13361 through 14696 of 33479
   ..working on genes 14697 through 16032 of 33479
   ..working on genes 16033 through 17368 of 33479
   ..working on genes 17369 through 18704 of 33479
   ..working on genes 18705 through 20040 of 33479
   ..working on genes 20041 through 21376 of 33479
   ..working on genes 21377 through 22712 of 33479
   ..working on genes 22713 through 24048 of 33479
   ..working on genes 24049 through 25384 of 33479
   ..working on genes 25385 through 26720 of 33479
   ..working on genes 26721 through 28056 of 33479
   ..working on genes 28057 through 29392 of 33479
   ..working on genes 29393 through 30728 of 33479
   ..working on genes 30729 through 32064 of 33479
   ..working on genes 32065 through 33400 of 33479
   ..working on genes 33401 through 33479 of 33479
   Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
1      1    0.289 -1.570          0.852  6280.0  5.97e+03  11500
2      2    0.777 -1.350          0.884  1940.0  1.59e+03   5450
3      3    0.958 -1.310          0.946   818.0  5.14e+02   3510
4      4    0.891 -1.260          0.916   434.0  1.93e+02   2710
5      5    0.829 -1.170          0.927   275.0  7.89e+01   2300
6      6    0.793 -1.080          0.957   197.0  3.49e+01   2050
7      7    0.783 -1.010          0.971   153.0  1.63e+01   1880
8      8    0.786 -0.951          0.978   125.0  7.99e+00   1740
9      9    0.792 -0.916          0.975   106.0  4.09e+00   1630
10    10    0.813 -0.891          0.968    92.6  2.17e+00   1530
11    12    0.831 -0.863          0.953    73.2  6.68e-01   1360
12    14    0.864 -0.851          0.946    60.1  2.31e-01   1230
13    16    0.885 -0.845          0.945    50.6  8.54e-02   1120
14    18    0.909 -0.845          0.949    43.2  3.37e-02   1020
15    20    0.922 -0.844          0.951    37.4  1.39e-02    942
In [17]:
# plot the results:
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

Run co-expression similarity and adjacency and TOM

In [3]:
# co-expression similarity and adjacency using assigned softpower
softPower=3
adjacency = adjacency(datExpr, power = softPower)
In [4]:
# calculate the Topological Overlap Matrix (TOM)
# turn adjacency into topological overlap, i.e. translate the adjacency into 
# topological overlap matrix and calculate the corresponding dissimilarity 
TOM = TOMsimilarity(adjacency, TOMType = "signed", verbose = 5);
dissTOM = 1-TOM;
..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.

Cluster genes by TOM and merge modules with very similar expression profiles

In [18]:
# call the hierarchical clustering function
geneTree = hclust(as.dist(dissTOM), method = "average");
In [19]:
# plot the resulting clustering tree (dendrogram)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE, hang = 0.04);
In [29]:
# set the min module size
minModuleSize = 30;
# module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                deepSplit = 3, pamRespectsDendro = FALSE,
                minClusterSize = minModuleSize);
 ..cutHeight not given, setting it to 0.981  ===>  99% of the (truncated) height range in dendro.
 ..done.
In [30]:
table(dynamicMods)
dynamicColors = labels2colors(dynamicMods)
dynamicMods
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
9812 8163 3159 2469 2021 1973 1397  958  747  467  404  389  377  350  218  175 
  17   18   19   20 
 165  125   60   50 
In [31]:
table(dynamicColors)
dynamicColors
       black         blue        brown         cyan        green  greenyellow 
        1397         8163         3159          350         2021          404 
      grey60    lightcyan   lightgreen  lightyellow      magenta midnightblue 
         165          175          125           60          747          218 
        pink       purple          red    royalblue       salmon          tan 
         958          467         1973           50          377          389 
   turquoise       yellow 
        9812         2469 
In [32]:
# plot the dendrogram and colors underneath
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05,
                    main = "Gene dendrogram and module colors")
In [33]:
# calculate eigengenes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes
# calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs);
# cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = "average");
# plot the result
plot(METree, main = "Clustering of module eigengenes",
     xlab = "", sub = "")
MEDissThres = 0.05
# plot the cut line into the dendrogram
abline(h=MEDissThres, col = "red")
In [34]:
# call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
# merged module colors
mergedColors = merge$colors;
# eigengenes of the new merged modules 
mergedMEs = merge$newMEs;
 mergeCloseModules: Merging modules whose distance is less than 0.05
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 20 module eigengenes in given set.
   Calculating new MEs...
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 20 module eigengenes in given set.
In [35]:
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
                    c("Dynamic Tree Cut", "Merged dynamic"),
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
In [3]:
#note, dynamicColors and mergedColors are the same in this case
table(mergedColors)
mergedColors
       black         blue        brown         cyan        green  greenyellow 
        1397         8163         3159          350         2021          404 
      grey60    lightcyan   lightgreen  lightyellow      magenta midnightblue 
         165          175          125           60          747          218 
        pink       purple          red    royalblue       salmon          tan 
         958          467         1973           50          377          389 
   turquoise       yellow 
        9812         2469 

Identify the module that contains Vargula tsujii c-luciferase

The functionally demonstrated c-luciferase gene, along with some phylogenetically similar c-luciferase genes, are located in the pink module, which we refer to as the Bioluminescent Co-Regulatory Network (BCN).

In [4]:
#determine which column number in the datExpr object that corresponds to c-luciferase
which(colnames(datExpr) == "NODE_10321_length_1972_cov_1770.41_g3092_i1")
409
In [5]:
#determine the color of the module with c-luciferase
dynamicColors[[409]]
'pink'
In [57]:
# extract all modules as a table for downstream analyses 
SubGeneNames = colnames(datExpr)
for (color in dynamicColors){
  module=SubGeneNames[which(dynamicColors==color)]
  write.table(module, paste("module_",color, ".txt",sep=""), sep="\t", row.names=FALSE, col.names=FALSE,quote=FALSE)
}
In [39]:
# extract the pink module of interest from the dynamicColors object 
SubGeneNames = colnames(datExpr)
In [40]:
pink=as.data.frame(SubGeneNames[which(dynamicColors=="pink")])
names(pink)[1] <- "transcript_id"
In [148]:
head(pink)
A data.frame: 6 × 1
transcript_id
<chr>
1NODE_100020_length_140_cov_8.6_g92800_i0
2NODE_10035_length_2011_cov_7.68814_g7072_i0
3NODE_10049_length_2009_cov_1010.67_g4245_i1
4NODE_10218_length_1985_cov_18.5554_g7198_i0
5NODE_10321_length_1972_cov_1770.41_g3092_i1
6NODE_10331_length_1971_cov_6.85491_g7280_i0

Import the annotation for Vargula tsujii transcriptome

In [7]:
#read in the Trinotate sheet for Vargula tsujii. 
Trinotate_lym_subset <- read_excel("Trinotate_lym_subset.xlsx")

Identify other modules of interest and incorporate annotation information

In [44]:
#incorporate annotation for the pink module (the BCN)
pink_module_gene_annot_deesplit3 <- setDT(Trinotate_lym_subset, key = 'transcript_id')[J(pink)]
In [45]:
#write.csv(pink_module_gene_annot_deesplit3, file = "pink_module_gene_annot_power3_deepsplit3.csv")
In [21]:
#extract the module that is significantly correlated with the eye sample (from Section 4.8) from the dynamicColors object   
BCN_lightcyan_EYE=as.data.frame(SubGeneNames[which(dynamicColors=="lightcyan")])
names(BCN_lightcyan_EYE)[1] <- "transcript_id"
BCN_lightcyan_EYE_gene_annot <- setDT(Trinotate_lym_subset, key = 'transcript_id')[J(BCN_lightcyan_EYE)]
In [23]:
#write.csv(BCN_lightcyan_EYE_gene_annot, file = "BCN_lightcyan_EYE_gene_annot.csv")
In [27]:
#extract the module that is correlated with gut (from Section 4.8) from the dynamicColors object   
BCN_yellow_GUT=as.data.frame(SubGeneNames[which(dynamicColors=="yellow")])
names(BCN_yellow_GUT)[1] <- "transcript_id"

#match the annotation with each transcript 
BCN_yellow_GUT_gene_annot <- setDT(Trinotate_lym_subset, key = 'transcript_id')[J(BCN_yellow_GUT)]
In [28]:
#write.csv(BCN_yellow_GUT_gene_annot, file = "BCN_yellow_GUT_gene_annot.csv")

Module to trait heatmap

Identify modules (network) that are significantly associated with samples. The module eigengene provides a representative measure of the gene expression patterns within a module, allowing correlation with these traits to determine the most significant associations (Langfelder and Horvath, 2008). This correlation can then be visualized in a module-to-trait heatmap to determine the most significant associations.The following scripts are from the WGCNA package (Langfelder and Horvath, 2008).

In [120]:
traitData <- read_excel("traits_all_tsujii_updated_pink_deepsplit3.xlsx")
In [121]:
traitData <- as.data.frame(traitData)
In [122]:
names(traitData)
  1. 'Samples'
  2. 'Samples_names'
  3. 'Bio_UpperLip'
  4. 'Gut'
  5. 'Eye'
  6. 'Instar_Stage_AI'
  7. 'Instar_Stage_AII'
  8. 'Instar_Stage_AIII'
  9. 'Instar_Stage_AIV'
  10. 'Adult_Morning_Timepoint'
  11. 'Adult_forced_biolum'
In [123]:
head(traitData)
A data.frame: 6 × 11
SamplesSamples_namesBio_UpperLipGutEyeInstar_Stage_AIInstar_Stage_AIIInstar_Stage_AIIIInstar_Stage_AIVAdult_Morning_TimepointAdult_forced_biolum
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1A.1.counts.tab A 000000010
2A.2.counts.tab A 000000010
3A.3.counts.tab A 000000010
4AIF.1.counts.tabAIF000100000
5AIF.2.counts.tabAIF000100000
6AIF.3.counts.tabAIF000100000
In [124]:
allTraits = traitData
In [125]:
head(allTraits)
A data.frame: 6 × 11
SamplesSamples_namesBio_UpperLipGutEyeInstar_Stage_AIInstar_Stage_AIIInstar_Stage_AIIIInstar_Stage_AIVAdult_Morning_TimepointAdult_forced_biolum
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1A.1.counts.tab A 000000010
2A.2.counts.tab A 000000010
3A.3.counts.tab A 000000010
4AIF.1.counts.tabAIF000100000
5AIF.2.counts.tabAIF000100000
6AIF.3.counts.tabAIF000100000
In [126]:
names(allTraits)
  1. 'Samples'
  2. 'Samples_names'
  3. 'Bio_UpperLip'
  4. 'Gut'
  5. 'Eye'
  6. 'Instar_Stage_AI'
  7. 'Instar_Stage_AII'
  8. 'Instar_Stage_AIII'
  9. 'Instar_Stage_AIV'
  10. 'Adult_Morning_Timepoint'
  11. 'Adult_forced_biolum'
In [127]:
All_samples = rownames(datExpr)
In [128]:
traitRows = match(All_samples, allTraits$Samples)
In [129]:
traitRows
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 52
  20. 53
  21. 54
  22. 19
  23. 20
  24. 21
  25. 22
  26. 23
  27. 24
  28. 55
  29. 56
  30. 57
  31. 25
  32. 26
  33. 27
  34. 28
  35. 29
  36. 30
  37. 31
  38. 32
  39. 33
  40. 34
  41. 35
  42. 36
  43. 37
  44. 42
  45. 43
  46. 38
  47. 44
  48. 45
  49. 39
  50. 46
  51. 47
  52. 40
  53. 48
  54. 49
  55. 41
  56. 50
  57. 51
In [130]:
datTraits = allTraits[traitRows, -1]
In [131]:
rownames(datTraits) = allTraits[traitRows, 1]
In [132]:
# double check that row names agree
table(rownames(datTraits)==rownames(datExpr))
TRUE 
  57 
In [133]:
datTraits$Samples_names <- NULL
In [134]:
head(datTraits)
A data.frame: 6 × 9
Bio_UpperLipGutEyeInstar_Stage_AIInstar_Stage_AIIInstar_Stage_AIIIInstar_Stage_AIVAdult_Morning_TimepointAdult_forced_biolum
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
A.1.counts.tab000000010
A.2.counts.tab000000010
A.3.counts.tab000000010
AIF.1.counts.tab000100000
AIF.2.counts.tab000100000
AIF.3.counts.tab000100000
In [135]:
# sample network based on squared Euclidean distance
A=adjacency(t(datExpr),type="distance")
# this calculates the whole network connectivity
k=as.numeric(apply(A,2,sum))-1
# standardized connectivity
Z.k=scale(k)
In [ ]:
# designate samples as outlying
# if their Z.k value is below the threshold
thresholdZ.k=-5 # often -2.5
# the color vector indicates outlyingness (red)
outlierColor=ifelse(Z.k<thresholdZ.k,"red","black")
# calculate the cluster tree using flahsClust or hclust
sampleTree = flashClust(as.dist(1-A), method = "average")
# convert traits to a color representation:
# where red indicates high values
traitColors=data.frame(numbers2colors(datTraits,signed=FALSE))
dimnames(traitColors)[[2]]=paste(names(datTraits),"",sep="")
datColors=data.frame(outlierC=outlierColor,traitColors)
In [137]:
# choose a module assignment
moduleColors_Vtsujii=dynamicColors
# define numbers of genes and samples
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
# recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr,moduleColors_Vtsujii)$eigengenes
MEsFemale = orderMEs(MEs0)
modTraitCor = cor(MEsFemale, datTraits, use = "p")
modTraitP = corPvalueStudent(modTraitCor, nSamples)
#color code each association by the correlation value; display correlations and their p-values
textMatrix = paste(signif(modTraitCor, 2), "\n(",
signif(modTraitP, 1), ")", sep = "")
dim(textMatrix) = dim(modTraitCor)
par(mar = c(8, 8.5, 3, 3))
# display the correlation values within a heatmap plot
labeledHeatmap(Matrix = modTraitCor, xLabels = names(datTraits),
yLabels = names(MEsFemale), ySymbols = names(MEsFemale),
colorLabels =FALSE,colors = blueWhiteRed(50),textMatrix=textMatrix,
setStdMargins = FALSE, cex.text = 0.5, zlim = c(-1,1),
main = paste("Module-trait relationships"))

Network visualization

The Bioluminescent Co-regulatory Network (BCN) has over 900 co-expressed transcripts making it challenging to visualize the entire network topology comprehensively. To address this, we used Cytoscape to visualize the network and we subsetted the network to only include genes that are signifcantly upregulated (uniquely) in the bioluminescent upper lip from Section 9.4.

Export BCN for Cytoscape

Export the network of interest into edge and node list files for Cytoscape (Shannon et al., 2003).

In [19]:
# select module of interest
module = "pink"
# select module probes
genes = colnames(datExpr)
inModule = dynamicColors==module
modProbes = genes[inModule]
# select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule]
dimnames(modTOM) = list(modProbes, modProbes) 

# export the network into edge and node list files Cytoscape 
cyt = exportNetworkToCytoscape(modTOM,
edgeFile = paste("BCN-CytoscapeInput-edges-pink-deepslit3", paste(module, collapse="-"), ".txt", sep=""),
nodeFile = paste("BCN-CytoscapeInput-nodes-pink-deepsplit3", paste(module, collapse="-"), ".txt", sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = modProbes,
altNodeNames = modProbes,
nodeAttr = dynamicColors[inModule])
In [68]:
#for visualization of the BCN in Cytoscape, determine how many of the significantly upregulated genes expressed in the bioluminescent upper lip (from Section 9.4) are found in the BCN. 
BCN_DE_unique_Bio_UpperLip  <- pink  %>%
  filter(transcript_id %in% unique_genes_bio_upper_lip_info$transcript_id)

#add annotation 
BCN_DE_unique_Bio_UpperLip_annot <- left_join(BCN_DE_unique_Bio_UpperLip,Trinotate_lym_subset,by="transcript_id")

GO enrichment analyses for BCN

GO enrichment analyses was performed using topGO package (Alexa Rahnenfuhrer, 2024). The figures representing the GO enrichments presented here were utilized for analysis, but were not included in the main text of the publication. The GO enrichments were visualized with GO-Figure! package for publication (Reijnders and Waterhouse, 2021). GO-Figure! reference https://github.com/lmesrop/BCN_publication/tree/main/Go-Figure!.

Import GO annotations

In [8]:
#import the trinotate go sheet from the Trinotate output 
geneID2GO <- readMappings(file ="Trinotate_go_lym.txt")
geneNames <- names(geneID2GO)
In [9]:
#save the transcript ids of all the annotated genes under geneNames object 
geneNames<- as.character(Trinotate_lym_subset$transcript_id)
#save the transcript 
myInterestingGenes= as.character(pink[,1])
In [10]:
#subset the genesNames by the transcript IDs in my red module 
geneList <- factor(as.integer(geneNames %in% myInterestingGenes))
names(geneList) <- geneNames
head(geneList)
NODE_1_length_23329_cov_11.1744_g0_i0
0
NODE_2_length_22275_cov_6.9901_g1_i0
0
NODE_3_length_16360_cov_11.9819_g2_i0
0
NODE_4_length_15645_cov_338.265_g3_i0
0
NODE_5_length_15299_cov_16.4297_g0_i1
0
NODE_6_length_14792_cov_9.32463_g4_i0
0
Levels:
  1. '0'
  2. '1'

Use topGO to identify enriched biological processes in the BCN

In [11]:
#run the topGO function. 
GOdata <- new("topGOdata", ontology = "BP", allGenes = geneList,
                    annot = annFUN.gene2GO, gene2GO = geneID2GO)

results_go <- runTest(GOdata, algorithm="weight01", statistic="Fisher")
Building most specific GOs .....

	( 12645 GO terms found. )


Build GO DAG topology ..........

	( 13427 GO terms and 31130 relations. )


Annotating nodes ...............

	( 15506 genes annotated to the GO terms. )


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 1962 nontrivial nodes
		 parameters: 
			 test statistic: fisher


	 Level 16:	2 nodes to be scored	(0 eliminated genes)


	 Level 15:	5 nodes to be scored	(0 eliminated genes)


	 Level 14:	13 nodes to be scored	(23 eliminated genes)


	 Level 13:	24 nodes to be scored	(178 eliminated genes)


	 Level 12:	53 nodes to be scored	(594 eliminated genes)


	 Level 11:	88 nodes to be scored	(1823 eliminated genes)


	 Level 10:	148 nodes to be scored	(3560 eliminated genes)


	 Level 9:	205 nodes to be scored	(5138 eliminated genes)


	 Level 8:	241 nodes to be scored	(6569 eliminated genes)


	 Level 7:	319 nodes to be scored	(8857 eliminated genes)


	 Level 6:	335 nodes to be scored	(11685 eliminated genes)


	 Level 5:	278 nodes to be scored	(13122 eliminated genes)


	 Level 4:	158 nodes to be scored	(14316 eliminated genes)


	 Level 3:	74 nodes to be scored	(14874 eliminated genes)


	 Level 2:	18 nodes to be scored	(15191 eliminated genes)


	 Level 1:	1 nodes to be scored	(15490 eliminated genes)

In [12]:
#retrieve the GO enrichment 
goEnrichment   <- GenTable(GOdata, Fisher = results_go, orderBy = "Fisher", topNodes = 500, numChar=1000)
In [13]:
#graph the GO enrichment 
goEnrichment$Fisher <- as.numeric(goEnrichment$Fisher)
goEnrichment <- goEnrichment[goEnrichment$Fisher < 0.05,]
In [15]:
#extract transcript ids that are significantly enriched in the BCN
myterms =goEnrichment$GO.ID 
mygenes = genesInTerm(GOdata, myterms)
In [8]:
#extract the transcript ids for each GO term
var=c()
for (i in 1:length(myterms))
{
   myterm <- myterms[i]
   mygenesforterm <- mygenes[myterm][[1]]
   myfactor <- mygenesforterm %in% myInterestingGenes
   mygenesforterm2 <- mygenesforterm[myfactor == TRUE]
   mygenesforterm2 <- paste(mygenesforterm2, collapse=',')
   var[i]=paste(myterm,"genes:",mygenesforterm2)
}
In [17]:
#plot the GO enrichment results 
ntop = 135
ggdata <- goEnrichment[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) 
plot_BCN_GO_BP <- ggplot(ggdata,
  aes(x = Term, y = -log10(Fisher), size = Significant, fill = -log10(Fisher))) +
 ylim(-1,21) + 
  geom_point(shape = 21) +
  scale_size(range = c(2.5,12.5)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +
labs(
    title = 'GO enrichment - BP - BCN')+
  theme_bw(base_size = 20) +
labs(size= "Number of Genes")+
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 20, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 10, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),
    axis.text.x = element_blank(),
    axis.text.y = element_text(angle = 0, size = 13, face = 'bold', vjust = 0.5),
    axis.title = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 8, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    #Legend
    legend.key = element_blank(), 
    legend.text = element_text(size = 16, face = "bold"))+


  coord_flip()
In [19]:
plot_BCN_GO_BP + labs(x = NULL)
In [18]:
#function from the R package repr to visualize figures in Jupyter Notebook(optional)
options(repr.plot.width=14, repr.plot.height=20, repr.plot.res = 500)

Use topGO to identify enriched molecular functions in BCN

In [18]:
#run the topGO function
GOdata_MF <- new("topGOdata", ontology = "MF", allGenes = geneList,
                    annot = annFUN.gene2GO, gene2GO = geneID2GO)

results_go_MF <- runTest(GOdata_MF, algorithm="weight01", statistic="Fisher")
Building most specific GOs .....

	( 3583 GO terms found. )


Build GO DAG topology ..........

	( 3618 GO terms and 4733 relations. )


Annotating nodes ...............

	( 16040 genes annotated to the GO terms. )


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 453 nontrivial nodes
		 parameters: 
			 test statistic: fisher


	 Level 12:	3 nodes to be scored	(0 eliminated genes)


	 Level 11:	3 nodes to be scored	(0 eliminated genes)


	 Level 10:	12 nodes to be scored	(12 eliminated genes)


	 Level 9:	21 nodes to be scored	(51 eliminated genes)


	 Level 8:	28 nodes to be scored	(146 eliminated genes)


	 Level 7:	60 nodes to be scored	(3106 eliminated genes)


	 Level 6:	92 nodes to be scored	(3841 eliminated genes)


	 Level 5:	90 nodes to be scored	(6588 eliminated genes)


	 Level 4:	92 nodes to be scored	(8758 eliminated genes)


	 Level 3:	41 nodes to be scored	(13356 eliminated genes)


	 Level 2:	10 nodes to be scored	(14309 eliminated genes)


	 Level 1:	1 nodes to be scored	(15893 eliminated genes)

In [19]:
#retrieve the GO enrichment 
goEnrichment_MF   <- GenTable(GOdata_MF, Fisher = results_go_MF, orderBy = "Fisher", topNodes = 100, numChar=1000)
In [20]:
goEnrichment_MF$Fisher <- as.numeric(goEnrichment_MF$Fisher)
In [21]:
#plot GO MF enchriment plot 
ntop = 65
ggdata <- goEnrichment_MF[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) 
GO_MF_BCN_plot <- ggplot(ggdata,
  aes(x = Term, y = -log10(Fisher), size = Significant, fill = -log10(Fisher))) +

  ylim(-1,21) +
  geom_point(shape = 21) +
  scale_size(range = c(2.5,12.5)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +

  labs(
    title = 'GO Analysis - MF - BCN')+
   theme_bw(base_size = 20) +
labs(size= "Number of Genes")+
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 20, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 10, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),
    axis.text.x = element_blank(),
    axis.text.y = element_text(angle = 0, size = 13, face = 'bold', vjust = 0.5),
    axis.title = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 8, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    #Legend
    legend.key = element_blank(), 
    legend.text = element_text(size = 16, face = "bold"))+


  coord_flip()
In [24]:
GO_MF_BCN_plot + labs(x=NULL)
In [23]:
#function from the R package repr to visualize figures in Jupyter Notebook(optional)
options(repr.plot.width=18, repr.plot.height=12, repr.plot.res = 500)

BCN network connectivity

WGCNA identifies genes integral to a (module) network using a network characteristic called module membership (Langfelder and Horvath, 2008). Module membership represents connectivity of a gene with other genes within a module and is used to define centralised hub genes (Langfelder and Horvath, 2008). Module membership (MM) is a measure ranging from 0 to 1, with higher values indicating strong connectivity within a module and lower values indicating weak connectivity (Langfelder and Horvath, 2008). Genes with similar expression patterns within a module are not only correlated with the module's overall expression (MM) but also tightly interconnected with other genes in the same module (IC) (Langfelder and Horvath, 2008). In other words, genes that are strongly correlated with the overall expression pattern of their modules (high MM) are also highly connected to other genes within those modules (high IC). There is a strong correlation between intramodule connectivity and module membership (Langfelder and Horvath, 2008).

Determine connectivity between intramodular connectivity and module membership

In [60]:
#select the same power used for WGCNA analysis in Section 4.3
ADJ1=abs(cor(datExpr,use="p"))^3
Alldegrees1=intramodularConnectivity(ADJ1, dynamicColors)
head(Alldegrees1)
A data.frame: 6 × 4
kTotalkWithinkOutkDiff
<dbl><dbl><dbl><dbl>
NODE_1_length_23329_cov_11.1744_g0_i0 346.6775 22.90212323.7754-300.87329
NODE_10_length_13445_cov_1.1233_g7_i0 552.1223 255.48906296.6332 -41.14413
NODE_100_length_8069_cov_3.11705_g65_i0 527.9934 58.67613469.3172-410.64111
NODE_10000_length_2017_cov_3.60449_g7052_i03178.66632657.10639521.55992135.54650
NODE_100009_length_140_cov_19.7765_g92789_i0 153.1775 27.60152125.5760 -97.97451
NODE_10001_length_2016_cov_11.7874_g7053_i03079.63252567.48133512.15112055.33019
In [61]:
datME=moduleEigengenes(datExpr,dynamicColors)$eigengenes
signif(cor(datME, use="p"), 2)
A matrix: 20 × 20 of type dbl
MEblackMEblueMEbrownMEcyanMEgreenMEgreenyellowMEgrey60MElightcyanMElightgreenMElightyellowMEmagentaMEmidnightblueMEpinkMEpurpleMEredMEroyalblueMEsalmonMEtanMEturquoiseMEyellow
MEblack 1.0000 0.0083 0.170 0.20 0.310 0.180 0.440-0.2600-0.640 0.280 0.450-0.8500-0.560 0.2000 0.670 0.0330-0.29 0.580 0.480 0.72
MEblue 0.0083 1.0000-0.440-0.15-0.400-0.180-0.280-0.1800 0.200-0.120-0.270-0.1300-0.260-0.0940-0.100-0.1500-0.23-0.520-0.600-0.23
MEbrown 0.1700-0.4400 1.000 0.31 0.900 0.073-0.050-0.4900-0.350 0.390 0.610 0.0880 0.360 0.1800 0.420 0.2500 0.79 0.750 0.710 0.50
MEcyan 0.2000-0.1500 0.310 1.00 0.550 0.230 0.280-0.4000 0.120 0.180 0.210 0.1300 0.160 0.1700 0.260 0.2100 0.25 0.330 0.410 0.39
MEgreen 0.3100-0.4000 0.900 0.55 1.000 0.130 0.022-0.6100-0.220 0.380 0.550 0.0490 0.300 0.1500 0.530 0.2500 0.76 0.740 0.780 0.60
MEgreenyellow 0.1800-0.1800 0.073 0.23 0.130 1.000 0.190 0.0720-0.071-0.026-0.140-0.1200-0.150 0.1000-0.059 0.1600-0.15 0.280 0.390 0.16
MEgrey60 0.4400-0.2800-0.050 0.28 0.022 0.190 1.000 0.3000-0.520 0.140 0.190-0.3400-0.097 0.2300 0.260 0.0400-0.41 0.440 0.400 0.31
MElightcyan-0.2600-0.1800-0.490-0.40-0.610 0.072 0.300 1.0000-0.013-0.190-0.300-0.0057-0.240-0.0230-0.300-0.1400-0.59-0.190-0.180-0.35
MElightgreen-0.6400 0.2000-0.350 0.12-0.220-0.071-0.520-0.0130 1.000-0.310-0.570 0.5100 0.140-0.3000-0.530-0.0070 0.18-0.700-0.500-0.47
MElightyellow 0.2800-0.1200 0.390 0.18 0.380-0.026 0.140-0.1900-0.310 1.000 0.440-0.0690 0.110 0.0200 0.310 0.1200 0.22 0.400 0.320 0.27
MEmagenta 0.4500-0.2700 0.610 0.21 0.550-0.140 0.190-0.3000-0.570 0.440 1.000-0.2000 0.083-0.1500 0.750 0.1300 0.29 0.660 0.500 0.48
MEmidnightblue-0.8500-0.1300 0.088 0.13 0.049-0.120-0.340-0.0057 0.510-0.069-0.200 1.0000 0.720-0.1100-0.400 0.1500 0.48-0.260-0.150-0.41
MEpink-0.5600-0.2600 0.360 0.16 0.300-0.150-0.097-0.2400 0.140 0.110 0.083 0.7200 1.000-0.0990-0.170 0.1500 0.60 0.025 0.062-0.35
MEpurple 0.2000-0.0940 0.180 0.17 0.150 0.100 0.230-0.0230-0.300 0.020-0.150-0.1100-0.099 1.0000-0.160 0.0021-0.03 0.340 0.340 0.41
MEred 0.6700-0.1000 0.420 0.26 0.530-0.059 0.260-0.3000-0.530 0.310 0.750-0.4000-0.170-0.1600 1.000 0.1100 0.12 0.620 0.520 0.58
MEroyalblue 0.0330-0.1500 0.250 0.21 0.250 0.160 0.040-0.1400-0.007 0.120 0.130 0.1500 0.150 0.0021 0.110 1.0000 0.17 0.280 0.330 0.29
MEsalmon-0.2900-0.2300 0.790 0.25 0.760-0.150-0.410-0.5900 0.180 0.220 0.290 0.4800 0.600-0.0300 0.120 0.1700 1.00 0.270 0.310 0.12
MEtan 0.5800-0.5200 0.750 0.33 0.740 0.280 0.440-0.1900-0.700 0.400 0.660-0.2600 0.025 0.3400 0.620 0.2800 0.27 1.000 0.950 0.76
MEturquoise 0.4800-0.6000 0.710 0.41 0.780 0.390 0.400-0.1800-0.500 0.320 0.500-0.1500 0.062 0.3400 0.520 0.3300 0.31 0.950 1.000 0.74
MEyellow 0.7200-0.2300 0.500 0.39 0.600 0.160 0.310-0.3500-0.470 0.270 0.480-0.4100-0.350 0.4100 0.580 0.2900 0.12 0.760 0.740 1.00
In [62]:
datKME =signedKME(datExpr, datME, outputColumnName="MM.")
head(datKME)
A data.frame: 6 × 20
MM.blackMM.blueMM.brownMM.cyanMM.greenMM.greenyellowMM.grey60MM.lightcyanMM.lightgreenMM.lightyellowMM.magentaMM.midnightblueMM.pinkMM.purpleMM.redMM.royalblueMM.salmonMM.tanMM.turquoiseMM.yellow
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_1_length_23329_cov_11.1744_g0_i0-0.30368791-0.14969642 0.3875821 0.05762415 0.3249978-0.16801729-0.2477889-0.2437221 0.30013070-0.068409302 0.12153560 0.28455650 0.39659544-0.04627714-0.05438410 0.23865152 0.581943018 0.004672927 0.05947533-0.055411594
NODE_10_length_13445_cov_1.1233_g7_i0-0.05108110-0.41002566 0.2163960 0.30556354 0.1633261 0.50617341 0.3025467 0.1550024-0.14813914 0.051677039 0.15658071 0.14895643 0.28791707 0.08394362-0.04867757 0.14479725-0.007502304 0.330575274 0.37633456-0.001410588
NODE_100_length_8069_cov_3.11705_g65_i0 0.29791030-0.03715798 0.4437366 0.08678794 0.4869341-0.13784156-0.1670715-0.4362257-0.23442872 0.150188218 0.37950897-0.10886823 0.05457831 0.09944390 0.45220631 0.07830078 0.331509015 0.357280041 0.35134912 0.373944691
NODE_10000_length_2017_cov_3.60449_g7052_i0-0.04294892 0.97444455-0.4224506-0.20361634-0.4235453-0.14042822-0.3045532-0.1369213 0.16586033-0.087557616-0.24217527-0.07792424-0.22583888-0.09629169-0.11449203-0.15310422-0.234757522-0.502923156-0.58710040-0.259241821
NODE_100009_length_140_cov_19.7765_g92789_i0-0.16559540-0.01569032 0.3262561 0.08709405 0.1923632 0.08616494-0.1177858-0.1130039-0.03087691-0.006029759 0.03440777 0.17817544 0.16860468 0.15141624-0.07300625 0.27795238 0.341849655 0.128458318 0.11203479 0.075031356
NODE_10001_length_2016_cov_11.7874_g7053_i0 0.01187859 0.96344965-0.4208837-0.20222829-0.4171477-0.13399502-0.2709880-0.1112112 0.15194117-0.106911373-0.24338593-0.15445441-0.26406670-0.08567098-0.11270397-0.11142467-0.254735114-0.499205222-0.57808094-0.231571301
In [63]:
nrow(datKME)
33479
In [64]:
#determine the module membership vs intramodular connectivity for the pink module (the BCN)
which.color="pink" 
restrictGenes=dynamicColors==which.color
verboseScatterplot(Alldegrees1$kWithin[ restrictGenes],
(datKME[restrictGenes, paste("MM.", which.color, sep="")])^10,
col=which.color,
xlab="Intramodular Connectivity",
ylab="(Module Membership)^3")
In [119]:
#function from the R package repr to visualize figures in Jupyter Notebook(optional)
options(repr.plot.width=4, repr.plot.height=3)

Identify genes in the BCN with highest intramodular connectivity

Identify all co-expressed transcripts within the BCN (red module) that have a high module membership (MM > 0.8). An MM value of 0.8 or higher indicates a strong association with the module eigengene, suggesting that the gene plays a central role in the module's network and can be considered a hub gene (Langfelder and Horvath, 2008).

In [65]:
combined_datKME_ADJ1 = merge(datKME, Alldegrees1, by=0)
In [25]:
head(combined_datKME_ADJ1)
A data.frame: 6 × 25
Row.namesMM.blackMM.blueMM.brownMM.cyanMM.greenMM.greenyellowMM.grey60MM.lightcyanMM.lightgreenMM.redMM.royalblueMM.salmonMM.tanMM.turquoiseMM.yellowkTotalkWithinkOutkDiff
<I<chr>><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1NODE_1_length_23329_cov_11.1744_g0_i0 -0.30368791-0.14969642 0.3875821 0.05762415 0.3249978-0.16801729-0.2477889-0.2437221 0.30013070-0.05438410 0.23865152 0.581943018 0.004672927 0.05947533-0.055411594 346.6775 22.90212323.7754-300.87329
2NODE_10_length_13445_cov_1.1233_g7_i0 -0.05108110-0.41002566 0.2163960 0.30556354 0.1633261 0.50617341 0.3025467 0.1550024-0.14813914-0.04867757 0.14479725-0.007502304 0.330575274 0.37633456-0.001410588 552.1223 255.48906296.6332 -41.14413
3NODE_100_length_8069_cov_3.11705_g65_i0 0.29791030-0.03715798 0.4437366 0.08678794 0.4869341-0.13784156-0.1670715-0.4362257-0.23442872 0.45220631 0.07830078 0.331509015 0.357280041 0.35134912 0.373944691 527.9934 58.67613469.3172-410.64111
4NODE_10000_length_2017_cov_3.60449_g7052_i0 -0.04294892 0.97444455-0.4224506-0.20361634-0.4235453-0.14042822-0.3045532-0.1369213 0.16586033-0.11449203-0.15310422-0.234757522-0.502923156-0.58710040-0.2592418213178.66632657.10639521.55992135.54650
5NODE_100009_length_140_cov_19.7765_g92789_i0-0.16559540-0.01569032 0.3262561 0.08709405 0.1923632 0.08616494-0.1177858-0.1130039-0.03087691-0.07300625 0.27795238 0.341849655 0.128458318 0.11203479 0.075031356 153.1775 27.60152125.5760 -97.97451
6NODE_10001_length_2016_cov_11.7874_g7053_i0 0.01187859 0.96344965-0.4208837-0.20222829-0.4171477-0.13399502-0.2709880-0.1112112 0.15194117-0.11270397-0.11142467-0.254735114-0.499205222-0.57808094-0.2315713013079.63252567.48133512.15112055.33019
In [68]:
subset_combined_datKME_ADJ1 = dplyr::select(combined_datKME_ADJ1, c("Row.names","MM.pink", "kWithin","kTotal"))
In [69]:
head(subset_combined_datKME_ADJ1)
A data.frame: 6 × 4
Row.namesMM.pinkkWithinkTotal
<I<chr>><dbl><dbl><dbl>
1NODE_1_length_23329_cov_11.1744_g0_i0 0.39659544 22.90212 346.6775
2NODE_10_length_13445_cov_1.1233_g7_i0 0.28791707 255.48906 552.1223
3NODE_100_length_8069_cov_3.11705_g65_i0 0.05457831 58.67613 527.9934
4NODE_10000_length_2017_cov_3.60449_g7052_i0 -0.225838882657.106393178.6663
5NODE_100009_length_140_cov_19.7765_g92789_i0 0.16860468 27.60152 153.1775
6NODE_10001_length_2016_cov_11.7874_g7053_i0 -0.264066702567.481333079.6325
In [71]:
#make the transcript ids rownames again
row.names(subset_combined_datKME_ADJ1) <- subset_combined_datKME_ADJ1$Row.names
subset_combined_datKME_ADJ1[1] <- NULL
In [72]:
head(subset_combined_datKME_ADJ1)
A data.frame: 6 × 3
MM.pinkkWithinkTotal
<dbl><dbl><dbl>
NODE_1_length_23329_cov_11.1744_g0_i0 0.39659544 22.90212 346.6775
NODE_10_length_13445_cov_1.1233_g7_i0 0.28791707 255.48906 552.1223
NODE_100_length_8069_cov_3.11705_g65_i0 0.05457831 58.67613 527.9934
NODE_10000_length_2017_cov_3.60449_g7052_i0-0.225838882657.106393178.6663
NODE_100009_length_140_cov_19.7765_g92789_i0 0.16860468 27.60152 153.1775
NODE_10001_length_2016_cov_11.7874_g7053_i0-0.264066702567.481333079.6325
In [97]:
#subset the subset_combined_datKME_ADJ1 to include just transcripts from the red module (the BCN)
top_datKME_ADJ1_.8 <-subset_combined_datKME_ADJ1 %>% dplyr::filter(subset_combined_datKME_ADJ1$MM.pink >= 0.79)
In [98]:
head(top_datKME_ADJ1_.8)
A data.frame: 6 × 3
MM.pinkkWithinkTotal
<dbl><dbl><dbl>
NODE_10321_length_1972_cov_1770.41_g3092_i10.8076418197.8559674.0547
NODE_10505_length_1950_cov_293.099_g7393_i00.9096048271.2500681.6360
NODE_10519_length_1949_cov_2.1246_g7404_i00.8919787256.8381536.6456
NODE_10648_length_1930_cov_48.5136_g7489_i00.8666351236.1228737.2166
NODE_106809_length_128_cov_1412.44_g99589_i00.9107073286.8564663.0907
NODE_10748_length_1916_cov_1427.62_g3092_i40.7953738189.0439689.2682
In [99]:
#make rowname a column for downstream subsetting 
top_datKME_ADJ1_.8$transcript_id <- rownames(top_datKME_ADJ1_.8)
In [100]:
head(top_datKME_ADJ1_.8)
A data.frame: 6 × 4
MM.pinkkWithinkTotaltranscript_id
<dbl><dbl><dbl><chr>
NODE_10321_length_1972_cov_1770.41_g3092_i10.8076418197.8559674.0547NODE_10321_length_1972_cov_1770.41_g3092_i1
NODE_10505_length_1950_cov_293.099_g7393_i00.9096048271.2500681.6360NODE_10505_length_1950_cov_293.099_g7393_i0
NODE_10519_length_1949_cov_2.1246_g7404_i00.8919787256.8381536.6456NODE_10519_length_1949_cov_2.1246_g7404_i0
NODE_10648_length_1930_cov_48.5136_g7489_i00.8666351236.1228737.2166NODE_10648_length_1930_cov_48.5136_g7489_i0
NODE_106809_length_128_cov_1412.44_g99589_i00.9107073286.8564663.0907NODE_106809_length_128_cov_1412.44_g99589_i0
NODE_10748_length_1916_cov_1427.62_g3092_i40.7953738189.0439689.2682NODE_10748_length_1916_cov_1427.62_g3092_i4
In [102]:
#reorganize the column orders

top_datKME_ADJ1_.8_column_reordered <- top_datKME_ADJ1_.8[, c("transcript_id", "MM.pink", "kWithin", "kTotal")] 

head(top_datKME_ADJ1_.8_column_reordered)
A data.frame: 6 × 4
transcript_idMM.pinkkWithinkTotal
<chr><dbl><dbl><dbl>
NODE_10321_length_1972_cov_1770.41_g3092_i1NODE_10321_length_1972_cov_1770.41_g3092_i1 0.8076418197.8559674.0547
NODE_10505_length_1950_cov_293.099_g7393_i0NODE_10505_length_1950_cov_293.099_g7393_i0 0.9096048271.2500681.6360
NODE_10519_length_1949_cov_2.1246_g7404_i0NODE_10519_length_1949_cov_2.1246_g7404_i0 0.8919787256.8381536.6456
NODE_10648_length_1930_cov_48.5136_g7489_i0NODE_10648_length_1930_cov_48.5136_g7489_i0 0.8666351236.1228737.2166
NODE_106809_length_128_cov_1412.44_g99589_i0NODE_106809_length_128_cov_1412.44_g99589_i00.9107073286.8564663.0907
NODE_10748_length_1916_cov_1427.62_g3092_i4NODE_10748_length_1916_cov_1427.62_g3092_i4 0.7953738189.0439689.2682
In [103]:
# reorder spreadsheet based on MM.red column from largest to smallest 

top_datKME_ADJ1_.8_column_reordered_desc <-top_datKME_ADJ1_.8_column_reordered %>% dplyr::arrange(desc(kWithin))
In [104]:
head(top_datKME_ADJ1_.8_column_reordered_desc)
A data.frame: 6 × 4
transcript_idMM.pinkkWithinkTotal
<chr><dbl><dbl><dbl>
NODE_25135_length_745_cov_25.1174_g18671_i0NODE_25135_length_745_cov_25.1174_g18671_i00.9474210309.4572695.7833
NODE_46791_length_332_cov_515.13_g39571_i0NODE_46791_length_332_cov_515.13_g39571_i0 0.9381178308.3861756.7506
NODE_37752_length_426_cov_358.65_g30590_i0NODE_37752_length_426_cov_358.65_g30590_i0 0.9427033307.5180720.5851
NODE_26371_length_698_cov_191.986_g19772_i0NODE_26371_length_698_cov_191.986_g19772_i00.9369760307.3016710.4163
NODE_37894_length_424_cov_305.016_g30731_i0NODE_37894_length_424_cov_305.016_g30731_i00.9344361307.2833706.8262
NODE_38913_length_410_cov_355.231_g31740_i0NODE_38913_length_410_cov_355.231_g31740_i00.9383374306.8618691.1877
In [105]:
#annotate the top datKME genes 
top_datKME_ADJ1_.8_column_reordered_desc_annot_trinotate <- left_join(top_datKME_ADJ1_.8_column_reordered_desc,Trinotate_lym_subset,by="transcript_id")
In [106]:
head(top_datKME_ADJ1_.8_column_reordered_desc_annot_trinotate)
A data.frame: 6 × 19
transcript_idMM.pinkkWithinkTotal#gene_idsprot_Top_BLASTX_hitRNAMMERprot_idprot_coordssprot_Top_BLASTP_hitPfamSignalPTmHMMeggnogKegggene_ontology_blastgene_ontology_pfamtranscriptpeptide
<chr><dbl><dbl><dbl><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
1NODE_25135_length_745_cov_25.1174_g18671_i00.9474210309.4572695.7833g18671..NODE_25135_length_745_cov_25.1174_g18671_i0.p195-484[-].PF00061.23^Lipocalin^Lipocalin / cytosolic fatty-acid binding protein family^33-125^E:4e-08........
2NODE_46791_length_332_cov_515.13_g39571_i0 0.9381178308.3861756.7506g39571... . .. ........
3NODE_37752_length_426_cov_358.65_g30590_i0 0.9427033307.5180720.5851g30590... . .. ........
4NODE_26371_length_698_cov_191.986_g19772_i00.9369760307.3016710.4163g19772... . .. ........
5NODE_37894_length_424_cov_305.016_g30731_i00.9344361307.2833706.8262g30731... . .. ........
6NODE_38913_length_410_cov_355.231_g31740_i00.9383374306.8618691.1877g31740... . .. ........
In [107]:
#write.csv(top_datKME_ADJ1_.8_column_reordered_desc_annot_trinotate, file = "081224_top_datKME_ADJ1_.8_desc_BCN_pink_splitdeep3.csv")

Identify genes in the bioluminescent upper lip with highest intramodular connectivity

Use the results from the DGE analysis in Section Step 9.4 to identify the significantly upregulated genes (expressed uniquely) with the highest module membership (MM > 0.8).

In [111]:
#determine how many of the co-expressed genes in the BCN with MM > 0.8 are significantly upregulated in the bioluminescent upper lip

top_datKME_ADJ1_.8_column_reordered_desc_annot_DE_bio_upper_lip  <- top_datKME_ADJ1_.8_column_reordered_desc_annot_trinotate %>%
  inner_join(unique_genes_bio_upper_lip_info, by = "transcript_id")
In [112]:
#join 
top_datKME_ADJ1_.8_column_reordered_desc_annot_DE_bio_upper_lip_logfold <- left_join(top_datKME_ADJ1_.8_column_reordered_desc_annot_DE_bio_upper_lip ,unique_genes_bio_upper_lip_info,by="transcript_id")

QC for Differential Gene Expression

Import V.tsujii gene expression matrix which consists of three tissue types - upper lip, compound eye and gut - with five biological replicates for each tissue. Generate the sample name sheet (meta sheet) for downstream DESeq2 analyses.

In [27]:
#read in the gene expression matrix 
organ_level_vt <- read.delim("combined_vargula_tsujii.tab", header = TRUE, sep = "\t", quote = "")
In [28]:
head(organ_level_vt)
A data.frame: 6 × 17
XVt.1A.counts.tabVt.1B.counts.tabVt.1C.counts.tabVt.2A.counts.tabVt.2B.counts.tabVt.2C.counts.tabVt.3A.counts.tabVt.3B.counts.tabVt.3C.counts.tabVt.4A.counts.tabVt.4B.counts.tabVt.4C.counts.tabVt.5A.counts.tabVt.5B.counts.tabVt.5C.counts.tabX.1
<chr><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><lgl>
1NODE_100001_length_141_cov_1.03488_g92781_i0 00 0 0 0 0 0 0 0 0 0 0 00 1NA
2NODE_10000_length_2017_cov_3.60449_g7052_i0 191 8 6 4 814015 9029 13532631NA
3NODE_100017_length_140_cov_9.63529_g92797_i0 60 2 1 0 5 5 0 9 0 1 7 20 9NA
4NODE_10001_length_2016_cov_11.7874_g7053_i0 232294820180662322040305149285NA
5NODE_100020_length_140_cov_8.6_g92800_i0 00 0 2 0 0 0 0 0 0 0 0 00 0NA
6NODE_100021_length_140_cov_8.04706_g92801_i0 00 0 0 0 1 0 0 3 0 0 2 10 0NA
In [29]:
#remove the extra column that got inserted 
row.names(organ_level_vt) <-organ_level_vt$X
organ_level_vt[1]<-NULL
organ_level_vt$X.1 <- NULL
In [33]:
#import the sample sheet
sampleinfo <- read_excel("sampleinfo_vtsujii_DGE.xlsx")
In [34]:
sampleinfo
A tibble: 15 × 2
samplesgroup
<chr><chr>
Vt.1A.counts.tabUpper_lip
Vt.1B.counts.tabEyes
Vt.1C.counts.tabGut
Vt.2A.counts.tabUpper_lip
Vt.2B.counts.tabEyes
Vt.2C.counts.tabGut
Vt.3A.counts.tabUpper_lip
Vt.3B.counts.tabEyes
Vt.3C.counts.tabGut
Vt.4A.counts.tabUpper_lip
Vt.4B.counts.tabEyes
Vt.4C.counts.tabGut
Vt.5A.counts.tabUpper_lip
Vt.5B.counts.tabEyes
Vt.5C.counts.tabGut
In [35]:
#generate the DESeq data set 
dds_count_table_organ_level <- DESeqDataSetFromMatrix(countData = organ_level_vt, colData = sampleinfo, design = ~group)
Warning message in DESeqDataSet(se, design = design, ignoreRank):
“some variables in design formula are characters, converting to factors”
In [36]:
#run DESeq2 function and normalization  
dds_vtsujii <- DESeq(dds_count_table_organ_level, betaPrior = FALSE, parallel = TRUE)
estimating size factors

estimating dispersions

gene-wise dispersion estimates: 38 workers

mean-dispersion relationship

final dispersion estimates, fitting model and testing: 38 workers

In [37]:
#perform a variance-stabilizing transformation
dds_vtsujii_vsd <- varianceStabilizingTransformation(dds_vtsujii)
In [38]:
#transpose the matrix 
sampleDists_dds_vtsujii_vsd <- dist(t(assay(dds_vtsujii_vsd)))
In [43]:
#plot the heatmap
sampleDistMatrix_dds_vtsujii_vsd <- as.matrix(sampleDists_dds_vtsujii_vsd)
rownames(sampleDistMatrix_dds_vtsujii_vsd) <- paste(colData(dds_count_table_organ_level)$group) 
colnames(sampleDistMatrix_dds_vtsujii_vsd) <- colData(dds_count_table_organ_level)$samples
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix_dds_vtsujii_vsd,
          clustering_distance_rows=sampleDists_dds_vtsujii_vsd,
         clustering_distance_cols=sampleDists_dds_vtsujii_vsd,
         col=colors)
In [42]:
options(repr.plot.width=8, repr.plot.height=6, repr.plot.res = 150)
In [44]:
pcaData <- plotPCA(dds_vtsujii_vsd, intgroup="group", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=group, shape=group)) +
labs(color = "Tissue Types")+ labs(shape = "Tissue Types")+
  geom_point(size=5) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
   geom_point(size=5) +
theme(panel.grid.major = element_line(colour = "gray97",  size = 1), panel.grid.minor = element_line(linetype = "dotted"), panel.background = element_rect(fill = NA), 
   legend.key = element_rect(fill = "gray100")) + theme(axis.line = element_line(size = 0.5,linetype = "solid")) + 
  theme(panel.border = element_rect(colour = "black", fill=NA, size=1)) +
coord_fixed() +
scale_color_manual(values = c('#A5D6A7','#CE93D8', '#81D4FA'))+theme(
    legend.title = element_text(size = 16),
    legend.text = element_text(size = 14),  
    axis.title.x = element_text(size = 16),  
    axis.title.y = element_text(size = 16),  
    axis.text = element_text(size = 12)  
  )

Differential gene expression

Differential gene expression analysis was carried out in DESeq2 (Love et al., 2014). For Vargula tsujii, we determined differentially upregulated genes in three tissue types - bioluminescent upper lip, compound eye and gut - using five biological replicates for each tissue. DESeq2 was employed using a p-value < 0.05 and FC > 1.5 for the significance of differentially expressed genes using the Benjamini-Hochberg method to account for false discovery rate (FDR). Pairwise comparisons were done across tissue types (i.e., bioluminescent upper lip to compound eye, bioluminescent upper lip to gut, gut to compound eye). To determine tissue-specific differential upregulation, each tissue was compared to the other two (e.g., genes upregulated uniquely in the bioluminescent upper lip were determined by comparing upper lip expression to both compound eye and gut). For each pairwise comparison, the reference tissue was specified to determine the significantly upregulated genes in each tissue type (i.e., positive vs negative log2fold change). For example, in the bioluminescent upper lip vs eye comparison, if a gene has a positive log2fold change, then that gene is upregulated in the bioluminescent upper lip compared to the eye. Conversely, if a gene has a negative log2fold change then the gene is upregulated in eye compared to the upper lip. Genes that were upregulated uniquely in each tissue type were used for all downstream analyses.

In [45]:
#run DESeq2 function and normalization  
dds_vtsujii <- DESeq(dds_count_table_organ_level)
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

DGE - Gut vs Compound Eye

In [46]:
#set Eyes as reference tissue
dds_vtsujii$group <- relevel(dds_vtsujii$group, ref= "Eyes")
In [47]:
#rerun DESeq command after reference is specified
dds_vtsujii <- DESeq(dds_vtsujii)
using pre-existing size factors

estimating dispersions

found already estimated dispersions, replacing these

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

In [49]:
#define contrasts, extract results table, and shrink the log2 fold changes

res_tableOE_unshrunken_Gut_Vs_Eye <- results(dds_vtsujii, contrast= c("group", "Gut", "Eyes") , alpha = 0.05)


res_tableOE_Gut_Vs_Eye  <- lfcShrink(dds_vtsujii, contrast= c("group", "Gut", "Eyes"), res = res_tableOE_unshrunken_Gut_Vs_Eye )
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

In [50]:
mcols(res_tableOE_Gut_Vs_Eye, use.names=T)
DataFrame with 6 rows and 2 columns
                       type                               description
                <character>                               <character>
baseMean       intermediate mean of normalized counts for all samples
log2FoldChange      results log2 fold change (MAP): group Gut vs Eyes
lfcSE               results         standard error: group Gut vs Eyes
stat                results         Wald statistic: group Gut vs Eyes
pvalue              results      Wald test p-value: group Gut vs Eyes
padj                results                      BH adjusted p-values
In [54]:
# set thresholds
# lfc.cutoff value of 0.58 translates to a 1.5-fold change in expression
# padj.cutoff value of 0.05 

padj.cutoff <- 0.05
lfc.cutoff <- 0.58
In [55]:
#convert to tibble 
res_tableOE_Gut_Vs_Eye_tb <- res_tableOE_Gut_Vs_Eye %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble()
In [56]:
#determine all differentially expressed genes that meet the cutoff values (i.e., padj.cutoff of 0.05 and lfc.cutoff of 0.58, both upregulated and downregulated)
sigOE_Gut_Vs_Eye <- res_tableOE_Gut_Vs_Eye_tb %>% filter(padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
In [57]:
#extract all genes that are significantly upregulated in the Gut (positive log2 fold change)
sigOE_UPREGULATED_logfold_Gut_vs_Eye <- sigOE_Gut_Vs_Eye %>%  filter(padj < padj.cutoff & log2FoldChange > lfc.cutoff)
In [58]:
head(sigOE_UPREGULATED_logfold_Gut_vs_Eye)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10010_length_2015_cov_7.32245_g7057_i0 7.4816914.3290051.19867233.6464322.659066e-043.236611e-03
NODE_10015_length_2014_cov_682.243_g7060_i0 570.3455457.1526040.95577687.4040691.320736e-132.149895e-11
NODE_10040_length_2011_cov_2.41973_g7075_i0 14.8864865.9607081.25773134.6833272.822558e-066.652356e-05
NODE_10047_length_2010_cov_1.26305_g7081_i0 2.1085493.0922971.11612362.7754705.512192e-033.450172e-02
NODE_100534_length_140_cov_1.47059_g93314_i0 21.1810266.4151441.33791754.6755692.931403e-066.869105e-05
NODE_1006_length_4905_cov_1.0567_g707_i0 2.2374503.4829541.22488122.8148334.880265e-033.151585e-02
In [59]:
colnames(sigOE_UPREGULATED_logfold_Gut_vs_Eye)[1]<- "transcript_id"
In [60]:
#add annotation
sigOE_UPREGULATED_logfold_Gut_vs_Eye_annot <- left_join(sigOE_UPREGULATED_logfold_Gut_vs_Eye,Trinotate_lym_subset,by="transcript_id")
In [62]:
#extract all genes that are significantly upregulated in the Compound Eye (negative log2fold change) compared to the Gut. 
sigOE_DOWNREGULATED_logfold_Gut_vs_Eye <- sigOE_Gut_Vs_Eye %>% filter(padj < padj.cutoff & log2FoldChange < -lfc.cutoff)
In [63]:
#save these two dataframes for downstream analysis in Section 9.4

#genes that are significantly upregulated in the Gut (positive log2fold change)
head(sigOE_UPREGULATED_logfold_Gut_vs_Eye)

#genes that are significantly upregulated in the Compound Eye (negative log2fold change) compared to the Gut. 
head(sigOE_DOWNREGULATED_logfold_Gut_vs_Eye)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10010_length_2015_cov_7.32245_g7057_i0 7.4816914.3290051.19867233.6464322.659066e-043.236611e-03
NODE_10015_length_2014_cov_682.243_g7060_i0 570.3455457.1526040.95577687.4040691.320736e-132.149895e-11
NODE_10040_length_2011_cov_2.41973_g7075_i0 14.8864865.9607081.25773134.6833272.822558e-066.652356e-05
NODE_10047_length_2010_cov_1.26305_g7081_i0 2.1085493.0922971.11612362.7754705.512192e-033.450172e-02
NODE_100534_length_140_cov_1.47059_g93314_i0 21.1810266.4151441.33791754.6755692.931403e-066.869105e-05
NODE_1006_length_4905_cov_1.0567_g707_i0 2.2374503.4829541.22488122.8148334.880265e-033.151585e-02
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10030_length_2012_cov_88.44_g145_i5 77.516326-2.0043520.7483891-2.6772667.422565e-034.264394e-02
NODE_1009_length_4897_cov_2.69228_g709_i0 18.842542-1.1065160.4161180-2.6597357.820226e-034.417345e-02
NODE_10106_length_2001_cov_20.3505_g7125_i0 32.497969-3.7863031.2661859-3.0125302.590799e-031.957288e-02
NODE_10108_length_2001_cov_9.98304_g7126_i0 28.846375-4.7706260.7722330-6.1236489.145690e-105.682196e-08
NODE_101175_length_138_cov_7.48193_g93955_i0 3.923975-3.8243711.2262699-3.0757522.099721e-031.670540e-02
NODE_10126_length_1999_cov_2.60802_g7138_i0 15.963167-5.2191880.9667613-5.3075571.111044e-073.921827e-06

DGE - Bioluminescent Upper Lip vs Compound Eye

In [64]:
#define contrasts, extract results table, and shrink the log2 fold changes

res_tableOE_unshrunken_Bio_UpperLip_Vs_Eye <- results(dds_vtsujii, contrast= c("group", "Upper_lip", "Eyes") , alpha = 0.05)


res_tableOE_Bio_UpperLip_Vs_Eye <- lfcShrink(dds_vtsujii, contrast= c("group", "Upper_lip", "Eyes"), res = res_tableOE_unshrunken_Bio_UpperLip_Vs_Eye)
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

In [65]:
mcols(res_tableOE_Bio_UpperLip_Vs_Eye, use.names=T)
DataFrame with 6 rows and 2 columns
                       type                                     description
                <character>                                     <character>
baseMean       intermediate       mean of normalized counts for all samples
log2FoldChange      results log2 fold change (MAP): group Upper_lip vs Eyes
lfcSE               results         standard error: group Upper_lip vs Eyes
stat                results         Wald statistic: group Upper lip vs Eyes
pvalue              results      Wald test p-value: group Upper lip vs Eyes
padj                results                            BH adjusted p-values
In [66]:
#convert to tibble 
res_tableOE_Bio_UpperLip_Vs_Eye_tb <- res_tableOE_Bio_UpperLip_Vs_Eye %>%  data.frame() %>% rownames_to_column(var="gene") %>% as_tibble()
In [67]:
#determine all differentially expressed genes that meet the cutoff values (i.e., padj.cutoff of 0.05 and lfc.cutoff of 0.58, both upregulated and downregulated)
sigOE_Bio_UpperLip_Vs_Eye <- res_tableOE_Bio_UpperLip_Vs_Eye_tb %>% filter(padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
In [68]:
#extract all genes that are significantly upregulated in the Bioluminescent Upper Lip (positive log2 fold change)
sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye <- sigOE_Bio_UpperLip_Vs_Eye %>% filter(padj < padj.cutoff & log2FoldChange > lfc.cutoff)
In [38]:
head(sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10010_length_2015_cov_7.32245_g7057_i0 7.4816914.2448361.20231983.5702753.566072e-040.0130704556
NODE_10049_length_2009_cov_1010.67_g4245_i1144.9421193.6968330.81408524.5291835.921214e-060.0004848792
NODE_10075_length_2006_cov_50.2255_g7102_i0 47.5412142.1079240.68383913.0815622.059176e-030.0439239389
NODE_10110_length_2001_cov_7.02312_g7128_i0 10.2969264.4313081.02219654.3947011.109251e-050.0008282910
NODE_10314_length_1973_cov_79.1867_g7270_i0 51.5047331.7009030.51996233.2693441.077972e-030.0286489771
NODE_1031_length_4873_cov_1.60834_g724_i0 14.3213164.1184641.04671623.9491327.843501e-050.0040845539
In [69]:
colnames(sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye)[1] <- "transcript_id"
In [70]:
#add annotation
sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye_annot <- left_join(sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye,Trinotate_lym_subset,by="transcript_id")
In [71]:
#extract all genes that are significantly upregulated in the Compound Eye (negative log2fold change) compared to the Bioluminescent Upper Lip, 
sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye <- sigOE_Bio_UpperLip_Vs_Eye %>% filter(padj < padj.cutoff & log2FoldChange < -lfc.cutoff)
In [72]:
colnames(sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye)[1] <- "transcript_id"
In [73]:
#add annotation 
sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye_annot <- left_join(sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye,Trinotate_lym_subset,by="transcript_id")
In [74]:
#save these two dataframes for downstream analysis in Section 9.4

#genes that are significantly upregulated in Bioluminescent Upper Lip (positive log2fold change) 
head(sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye)

#genes that are significantly upregulated in the Compound Eye (negative log2fold change) compared to the Bioluminescent Upper Lip. 
head(sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10010_length_2015_cov_7.32245_g7057_i0 7.4816914.2448361.20231983.5702753.566072e-040.0130704556
NODE_10049_length_2009_cov_1010.67_g4245_i1144.9421193.6968330.81408524.5291835.921214e-060.0004848792
NODE_10075_length_2006_cov_50.2255_g7102_i0 47.5412142.1079240.68383913.0815622.059176e-030.0439239389
NODE_10110_length_2001_cov_7.02312_g7128_i0 10.2969264.4313081.02219654.3947011.109251e-050.0008282910
NODE_10314_length_1973_cov_79.1867_g7270_i0 51.5047331.7009030.51996233.2693441.077972e-030.0286489771
NODE_1031_length_4873_cov_1.60834_g724_i0 14.3213164.1184641.04671623.9491327.843501e-050.0040845539
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10126_length_1999_cov_2.60802_g7138_i0 15.963167-3.4422930.9226241 -3.7048882.114840e-048.941592e-03
NODE_101633_length_138_cov_2.06024_g94413_i0 7.667808-6.1495541.0781402 -5.5163643.460855e-084.918433e-06
NODE_10218_length_1985_cov_18.5554_g7198_i0 61.993605-6.0859230.6771456 -8.8415809.437437e-193.922413e-16
NODE_1021_length_4884_cov_10.72_g718_i0 45.575618-6.5144800.8450182 -7.5173235.590900e-141.734596e-11
NODE_10232_length_1983_cov_42.0814_g7210_i0 434.539937-9.2868110.7189709-12.6132621.784501e-366.551496e-33
NODE_102471_length_136_cov_2.41975_g95251_i0 3.353358-4.7867481.3410759 -3.5717663.545819e-041.301788e-02

DGE - Bioluminescent Upper Lip vs Gut

In [77]:
#now set gut as reference
dds_vtsujii$group <- relevel(dds_vtsujii$group, ref= "Gut")
In [78]:
#rerun DESeq after setting a new reference
dds_vtsujii <- DESeq(dds_vtsujii)
using pre-existing size factors

estimating dispersions

found already estimated dispersions, replacing these

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

In [79]:
#Define contrasts, extract results table, and shrink the log2 fold changes

res_tableOE_unshrunken_Bio_Upper_Lip_Vs_Gut <- results(dds_vtsujii, contrast= c("group", "Upper_lip", "Gut") , alpha = 0.05)


res_tableOE_Bio_Upper_Lip_Vs_Gut  <- lfcShrink(dds_vtsujii, contrast= c("group", "Upper_lip", "Gut"), res = res_tableOE_unshrunken_Bio_Upper_Lip_Vs_Gut )
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

In [80]:
mcols(res_tableOE_Bio_Upper_Lip_Vs_Gut, use.names=T)
DataFrame with 6 rows and 2 columns
                       type                                    description
                <character>                                    <character>
baseMean       intermediate      mean of normalized counts for all samples
log2FoldChange      results log2 fold change (MAP): group Upper_lip vs Gut
lfcSE               results         standard error: group Upper_lip vs Gut
stat                results         Wald statistic: group Upper lip vs Gut
pvalue              results      Wald test p-value: group Upper lip vs Gut
padj                results                           BH adjusted p-values
In [84]:
#convert to tibble
res_tableOE_Bio_Upper_Lip_Vs_Gut_tb <- res_tableOE_Bio_Upper_Lip_Vs_Gut %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble()
In [85]:
#determine all differentially expressed genes that meet the cutoff values (i.e., padj.cutoff of 0.05 and lfc.cutoff of 0.58, both upregulated and downregulated)
sigOE_Bio_Upper_Lip_Vs_Gut <- res_tableOE_Bio_Upper_Lip_Vs_Gut_tb %>% filter(padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
In [86]:
#extract all genes that are significantly upregulated in the Bioluminescent Upper Lip (positive log2 fold change)
sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut <- sigOE_Bio_Upper_Lip_Vs_Gut %>% filter(padj < padj.cutoff & log2FoldChange > lfc.cutoff)
In [87]:
colnames(sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)[1] <- "transcript_id"
In [88]:
#add annotation 
sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut_annot <- left_join(sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut,Trinotate_lym_subset,by="transcript_id")
In [89]:
#extract all genes that are significantly upregulated in the Gut (negative log2fold change) compared to the Bioluminescent Upper Lip. 
sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut <- sigOE_Bio_Upper_Lip_Vs_Gut %>% filter(padj < padj.cutoff & log2FoldChange < -lfc.cutoff)
In [90]:
colnames(sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)[1]<- "transcript_id"
In [91]:
#add annotation 
sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut_annot <- left_join(sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut, Trinotate_lym_subset,by="transcript_id")
In [129]:
#save these two dataframes for downstream analysis in Section 9.3

#genes that are significantly upregulated in the bioluminescent upper lip (positive log2 fold change)
head(sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)

#genes that are significantly upregulated in the gut (negative log2fold change) compared to the Bioluminescent Upper Lip. 
head(sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10035_length_2011_cov_7.68814_g7072_i0 31.4974643.5491891.11161323.1856211.444434e-031.694083e-02
NODE_10049_length_2009_cov_1010.67_g4245_i1 144.9421193.7802060.80275684.7029332.564508e-066.123575e-05
NODE_10092_length_2003_cov_126.275_g7115_i0 87.9102264.2261310.90710644.6530013.271381e-067.653491e-05
NODE_10106_length_2001_cov_20.3505_g7125_i0 32.4979694.0061901.26287373.1902461.421519e-031.671594e-02
NODE_10108_length_2001_cov_9.98304_g7126_i0 28.8463754.1983820.76851695.4217095.903195e-081.951548e-06
NODE_101175_length_138_cov_7.48193_g93955_i0 3.9239754.3389581.19206673.5540693.793190e-045.366714e-03
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10015_length_2014_cov_682.243_g7060_i0 570.345545-6.4174320.9300257-6.8644086.676754e-124.486477e-10
NODE_10040_length_2011_cov_2.41973_g7075_i0 14.886486-3.4248021.0781429-3.1345561.721142e-031.958089e-02
NODE_100534_length_140_cov_1.47059_g93314_i0 21.181026-5.4347731.2391311-4.2437082.198567e-054.265253e-04
NODE_10082_length_2005_cov_3.57487_g6803_i1 13.210637-2.1159220.6939908-3.0429012.343094e-032.522919e-02
NODE_10087_length_2004_cov_52.6834_g7112_i0 1.431373-4.0181871.3207673-3.0097002.615061e-032.759244e-02
NODE_10088_length_2004_cov_27.3961_g7113_i0 61.316935-7.1299261.0023331-6.8501827.375626e-124.919102e-10

Determine tissue-specific expression

To identify tissue-specific differential expression (i.e., significantly upregulated genes that are uniquely expressed), each tissue was compared to the other two and tissue-specific genes were extracted from the intersection of the Venn diagram (in Section 9.4.4). Genes that were significantly upregulated uniquely in each tissue type were used for all downstream analyses in the publication.

Bioluminescent upper lip

In [93]:
#combine genes upregulated in the Bioluminescent Upper Lip from pairwise comparisons - Bio Upper Lip vs Gut and Bio Upper Lip vs Eye - into a single dataframe. 


head(sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut) 

head(sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10035_length_2011_cov_7.68814_g7072_i0 31.4974643.5491891.11161323.1856211.444434e-031.694083e-02
NODE_10049_length_2009_cov_1010.67_g4245_i1 144.9421193.7802060.80275684.7029332.564508e-066.123575e-05
NODE_10092_length_2003_cov_126.275_g7115_i0 87.9102264.2261310.90710644.6530013.271381e-067.653491e-05
NODE_10106_length_2001_cov_20.3505_g7125_i0 32.4979694.0061901.26287373.1902461.421519e-031.671594e-02
NODE_10108_length_2001_cov_9.98304_g7126_i0 28.8463754.1983820.76851695.4217095.903195e-081.951548e-06
NODE_101175_length_138_cov_7.48193_g93955_i0 3.9239754.3389581.19206673.5540693.793190e-045.366714e-03
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10010_length_2015_cov_7.32245_g7057_i0 7.4816914.2448361.20231983.5702753.566072e-040.0130704556
NODE_10049_length_2009_cov_1010.67_g4245_i1144.9421193.6968330.81408524.5291835.921214e-060.0004848792
NODE_10075_length_2006_cov_50.2255_g7102_i0 47.5412142.1079240.68383913.0815622.059176e-030.0439239389
NODE_10110_length_2001_cov_7.02312_g7128_i0 10.2969264.4313081.02219654.3947011.109251e-050.0008282910
NODE_10314_length_1973_cov_79.1867_g7270_i0 51.5047331.7009030.51996233.2693441.077972e-030.0286489771
NODE_1031_length_4873_cov_1.60834_g724_i0 14.3213164.1184641.04671623.9491327.843501e-050.0040845539
In [94]:
colnames(sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)[1] <- "gene"
In [95]:
colnames(sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye)[1] <- "gene"
In [96]:
#merge
merged_upper_lips_df <- rbind(
  sigOE_UPREGULATED_logfold_Bio_Upper_Lip_Vs_Gut,
  sigOE_UPREGULATED_logfold_Bio_UpperLip_Vs_Eye
)
In [97]:
head(merged_upper_lips_df)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10035_length_2011_cov_7.68814_g7072_i0 31.4974643.5491891.11161323.1856211.444434e-031.694083e-02
NODE_10049_length_2009_cov_1010.67_g4245_i1 144.9421193.7802060.80275684.7029332.564508e-066.123575e-05
NODE_10092_length_2003_cov_126.275_g7115_i0 87.9102264.2261310.90710644.6530013.271381e-067.653491e-05
NODE_10106_length_2001_cov_20.3505_g7125_i0 32.4979694.0061901.26287373.1902461.421519e-031.671594e-02
NODE_10108_length_2001_cov_9.98304_g7126_i0 28.8463754.1983820.76851695.4217095.903195e-081.951548e-06
NODE_101175_length_138_cov_7.48193_g93955_i0 3.9239754.3389581.19206673.5540693.793190e-045.366714e-03
In [98]:
#remove gene duplicates while retaining one duplicate due to pairwise comparisons. The same gene can be found in both pairwise comparisons - Bio Upper Lip vs Gut and Bio Upper Lip vs Eye.
merged_upper_lips_unique <- merged_upper_lips_df[!duplicated(merged_upper_lips_df$gene), ]
In [99]:
head(merged_upper_lips_unique)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10035_length_2011_cov_7.68814_g7072_i0 31.4974643.5491891.11161323.1856211.444434e-031.694083e-02
NODE_10049_length_2009_cov_1010.67_g4245_i1 144.9421193.7802060.80275684.7029332.564508e-066.123575e-05
NODE_10092_length_2003_cov_126.275_g7115_i0 87.9102264.2261310.90710644.6530013.271381e-067.653491e-05
NODE_10106_length_2001_cov_20.3505_g7125_i0 32.4979694.0061901.26287373.1902461.421519e-031.671594e-02
NODE_10108_length_2001_cov_9.98304_g7126_i0 28.8463754.1983820.76851695.4217095.903195e-081.951548e-06
NODE_101175_length_138_cov_7.48193_g93955_i0 3.9239754.3389581.19206673.5540693.793190e-045.366714e-03
In [158]:
nrow(merged_upper_lips_unique)
1790
In [159]:
#write.csv(merged_upper_lips_unique, file = "Vtsujii_merged_bio_upper_lip_ALL_DE_genes.csv")

Compound eye

In [101]:
#combine genes upregulated in the compound eyes from the pairwise comparisons - Bio Upper Lip vs Eye and Gut vs Eye. 


head(sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye)

head(sigOE_DOWNREGULATED_logfold_Gut_vs_Eye)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10126_length_1999_cov_2.60802_g7138_i0 15.963167-3.4422930.9226241 -3.7048882.114840e-048.941592e-03
NODE_101633_length_138_cov_2.06024_g94413_i0 7.667808-6.1495541.0781402 -5.5163643.460855e-084.918433e-06
NODE_10218_length_1985_cov_18.5554_g7198_i0 61.993605-6.0859230.6771456 -8.8415809.437437e-193.922413e-16
NODE_1021_length_4884_cov_10.72_g718_i0 45.575618-6.5144800.8450182 -7.5173235.590900e-141.734596e-11
NODE_10232_length_1983_cov_42.0814_g7210_i0 434.539937-9.2868110.7189709-12.6132621.784501e-366.551496e-33
NODE_102471_length_136_cov_2.41975_g95251_i0 3.353358-4.7867481.3410759 -3.5717663.545819e-041.301788e-02
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10030_length_2012_cov_88.44_g145_i5 77.516326-2.0043520.7483891-2.6772667.422565e-034.264394e-02
NODE_1009_length_4897_cov_2.69228_g709_i0 18.842542-1.1065160.4161180-2.6597357.820226e-034.417345e-02
NODE_10106_length_2001_cov_20.3505_g7125_i0 32.497969-3.7863031.2661859-3.0125302.590799e-031.957288e-02
NODE_10108_length_2001_cov_9.98304_g7126_i0 28.846375-4.7706260.7722330-6.1236489.145690e-105.682196e-08
NODE_101175_length_138_cov_7.48193_g93955_i0 3.923975-3.8243711.2262699-3.0757522.099721e-031.670540e-02
NODE_10126_length_1999_cov_2.60802_g7138_i0 15.963167-5.2191880.9667613-5.3075571.111044e-073.921827e-06
In [102]:
colnames(sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye)[1]<- "gene"
In [104]:
merged_Eye_df <- rbind(sigOE_DOWNREGULATED_logfold_Bio_UpperLip_Vs_Eye , sigOE_DOWNREGULATED_logfold_Gut_vs_Eye)
In [105]:
merged_Eye_unique <-merged_Eye_df[!duplicated(merged_Eye_df$gene), ]
In [106]:
head(merged_Eye_unique)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10126_length_1999_cov_2.60802_g7138_i0 15.963167-3.4422930.9226241 -3.7048882.114840e-048.941592e-03
NODE_101633_length_138_cov_2.06024_g94413_i0 7.667808-6.1495541.0781402 -5.5163643.460855e-084.918433e-06
NODE_10218_length_1985_cov_18.5554_g7198_i0 61.993605-6.0859230.6771456 -8.8415809.437437e-193.922413e-16
NODE_1021_length_4884_cov_10.72_g718_i0 45.575618-6.5144800.8450182 -7.5173235.590900e-141.734596e-11
NODE_10232_length_1983_cov_42.0814_g7210_i0 434.539937-9.2868110.7189709-12.6132621.784501e-366.551496e-33
NODE_102471_length_136_cov_2.41975_g95251_i0 3.353358-4.7867481.3410759 -3.5717663.545819e-041.301788e-02
In [160]:
#write.csv(merged_Eye_unique, file = "Vtsujii_merged_Eye_ALL_DE_genes.csv")

Gut

In [107]:
#combine genes upregulated in the gut from pairwise comparisons - Gut vs Eye and Bio Upper Lip vs Gut. 

head(sigOE_UPREGULATED_logfold_Gut_vs_Eye)

head(sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10010_length_2015_cov_7.32245_g7057_i0 7.4816914.3290051.19867233.6464322.659066e-043.236611e-03
NODE_10015_length_2014_cov_682.243_g7060_i0 570.3455457.1526040.95577687.4040691.320736e-132.149895e-11
NODE_10040_length_2011_cov_2.41973_g7075_i0 14.8864865.9607081.25773134.6833272.822558e-066.652356e-05
NODE_10047_length_2010_cov_1.26305_g7081_i0 2.1085493.0922971.11612362.7754705.512192e-033.450172e-02
NODE_100534_length_140_cov_1.47059_g93314_i0 21.1810266.4151441.33791754.6755692.931403e-066.869105e-05
NODE_1006_length_4905_cov_1.0567_g707_i0 2.2374503.4829541.22488122.8148334.880265e-033.151585e-02
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10015_length_2014_cov_682.243_g7060_i0 570.345545-6.4174320.9300257-6.8644086.676754e-124.486477e-10
NODE_10040_length_2011_cov_2.41973_g7075_i0 14.886486-3.4248021.0781429-3.1345561.721142e-031.958089e-02
NODE_100534_length_140_cov_1.47059_g93314_i0 21.181026-5.4347731.2391311-4.2437082.198567e-054.265253e-04
NODE_10082_length_2005_cov_3.57487_g6803_i1 13.210637-2.1159220.6939908-3.0429012.343094e-032.522919e-02
NODE_10087_length_2004_cov_52.6834_g7112_i0 1.431373-4.0181871.3207673-3.0097002.615061e-032.759244e-02
NODE_10088_length_2004_cov_27.3961_g7113_i0 61.316935-7.1299261.0023331-6.8501827.375626e-124.919102e-10
In [108]:
colnames(sigOE_UPREGULATED_logfold_Gut_vs_Eye)[1] <- "gene"
In [109]:
colnames(sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)[1]<- "gene"
In [110]:
merged_Gut_df <- rbind(sigOE_UPREGULATED_logfold_Gut_vs_Eye , sigOE_DOWNREGULATED_logfold_Bio_Upper_Lip_Vs_Gut)


merged_Gut_unique <-merged_Gut_df[!duplicated(merged_Gut_df$gene), ]
In [111]:
head(merged_Gut_unique)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10010_length_2015_cov_7.32245_g7057_i0 7.4816914.3290051.19867233.6464322.659066e-043.236611e-03
NODE_10015_length_2014_cov_682.243_g7060_i0 570.3455457.1526040.95577687.4040691.320736e-132.149895e-11
NODE_10040_length_2011_cov_2.41973_g7075_i0 14.8864865.9607081.25773134.6833272.822558e-066.652356e-05
NODE_10047_length_2010_cov_1.26305_g7081_i0 2.1085493.0922971.11612362.7754705.512192e-033.450172e-02
NODE_100534_length_140_cov_1.47059_g93314_i0 21.1810266.4151441.33791754.6755692.931403e-066.869105e-05
NODE_1006_length_4905_cov_1.0567_g707_i0 2.2374503.4829541.22488122.8148334.880265e-033.151585e-02
In [162]:
#write.csv(merged_Gut_unique, file = "Vtsujii_merged_Gut_ALL_DE_genes.csv")

Venn Diagram - Extract tissue-specific genes

In [113]:
#generate a venn diagram to visualize shared significantly upregulated genes across tissue types and extract the genes that are unique to each tissue type. 
unique_venn_list <- list(
  Bio_Upper_Lip = merged_upper_lips_unique$gene  , 
  Gut = merged_Gut_unique$gene,
  Compound_Eye = merged_Eye_unique$gene
)

ggvenn_unique <- ggvenn(
  unique_venn_list, 
  fill_color = c('#81D4FA','#A5D6A7', '#CE93D8'),
  stroke_size = .7, set_name_size = 6, text_size = 5
)

ggvenn_unique
In [3]:
# Open a PDF device
pdf("Luminous_DGE.pdf", width = 8, height = 6)


ggvenn_unique 

dev.off()
png: 2
In [114]:
#prep dataframes for extraction
Bio_Upper_Lip <- as.data.frame(merged_upper_lips_unique$gene)
colnames(Bio_Upper_Lip)[1]<- "gene"
Gut <- as.data.frame(merged_Gut_unique$gene)
colnames(Gut)[1]<- "gene"
Compound_eye <- as.data.frame(merged_Eye_unique$gene)
colnames(Compound_eye)[1]<- "gene"
In [115]:
# compare and extract unique genes for each tissue type
unique_genes_bio_upper_lip <- anti_join(Bio_Upper_Lip, Gut, by = "gene") %>%
  anti_join(Compound_eye, by = "gene")

unique_genes_gut <- anti_join(Gut, Bio_Upper_Lip, by = "gene") %>%
  anti_join(Compound_eye, by = "gene")

unique_genes_compound_eye <- anti_join(Compound_eye, Bio_Upper_Lip, by = "gene") %>%
  anti_join(Gut, by = "gene")
In [116]:
nrow(unique_genes_bio_upper_lip)
595
In [117]:
nrow(unique_genes_gut)
2534
In [118]:
nrow(unique_genes_compound_eye)
1144
In [119]:
#add the annotations back to the unique genes in each tissue type by subsetting
In [120]:
unique_genes_bio_upper_lip_info  <- merged_upper_lips_unique %>%
  filter(gene %in% unique_genes_bio_upper_lip$gene)
In [122]:
unique_genes_eye_info  <- merged_Eye_unique %>%
  filter(gene %in% unique_genes_compound_eye$gene)
In [124]:
unique_genes_gut_info  <- merged_Gut_unique %>%
  filter(gene %in% unique_genes_gut$gene)
In [126]:
#add annotations back 
colnames(unique_genes_bio_upper_lip_info)[1]<- "transcript_id"
unique_genes_bio_upper_lip_info_annot <- left_join(unique_genes_bio_upper_lip_info,Trinotate_lym_subset,by="transcript_id")
In [127]:
#add annotations back 
colnames(unique_genes_eye_info)[1]<- "transcript_id"
unique_genes_eye_info_annot <- left_join(unique_genes_eye_info,Trinotate_lym_subset,by="transcript_id")
In [128]:
#add annotations backs
colnames(unique_genes_gut_info)[1] <- "transcript_id"
unique_genes_gut_info_annot <- left_join(unique_genes_gut_info,Trinotate_lym_subset,by="transcript_id")
In [107]:
#write.csv(unique_genes_bio_upper_lip_info_annot, file = "df_Vtsujii_sigfig_upreg_unique_BioUpperLip.csv")
In [108]:
#write.csv(unique_genes_eye_info_annot, file = "df_Vtsujii_sigfig_upreg_unique_comEye.csv")
In [109]:
#write.csv(unique_genes_gut_info_annot, file = "df_Vtsujii_sigfig_upreg_unique_Gut.csv")

DGE - GO enrichment analyses for bioluminescent upper lip

The topGO package was used to perform GO enrichment analysis on significantly upregulated genes (i.e., uniquely expressed) in the bioluminescent upper lip. (Alexa Rahnenfuhrer, 2024). The figures representing the GO enrichments presented here were utilized for analysis, but were not included in the main text of the publication. The GO enrichments were visualized with GO-Figure! package for publication (Reijnders and Waterhouse, 2021). GO-Figure! reference https://github.com/lmesrop/BCN_publication/tree/main/Go-Figure!.

In [138]:
#import the trinotate go sheet from Trinotate output 
geneID2GO_bio <- readMappings(file ="Trinotate_go_lym.txt")
In [139]:
#significantly upregulated genes (i.e. expressed uniquely) in the bioluminescent upper lip 

head(unique_genes_bio_upper_lip_info)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
NODE_10035_length_2011_cov_7.68814_g7072_i0 31.4974643.5491891.11161323.1856211.444434e-031.694083e-02
NODE_10049_length_2009_cov_1010.67_g4245_i1 144.9421193.7802060.80275684.7029332.564508e-066.123575e-05
NODE_10092_length_2003_cov_126.275_g7115_i0 87.9102264.2261310.90710644.6530013.271381e-067.653491e-05
NODE_10123_length_1999_cov_24.5818_g6951_i1 12.9709451.9215760.68238472.8150524.876929e-034.595970e-02
NODE_10147_length_1995_cov_11.3794_g7152_i0 7.9424923.9612351.16034613.3728717.438888e-049.602505e-03
NODE_102011_length_137_cov_2.19512_g94791_i0 83.5905496.0765691.69826373.9424798.064363e-051.377157e-03
In [140]:
#save the transcript ids of all the annotated genes under geneNames object 
geneNames_bio<- as.character(Trinotate_lym_subset$transcript_id)
In [143]:
#save the transcript 
myInterestingGenes_bio= as.character(unique_genes_bio_upper_lip_info$transcript_id)
In [144]:
#subset the genesNames by the transcript IDs in my red module 
geneList_bio <- factor(as.integer(geneNames_bio %in% myInterestingGenes_bio))
names(geneList_bio) <- geneNames_bio
head(geneList_bio)
NODE_1_length_23329_cov_11.1744_g0_i0
0
NODE_2_length_22275_cov_6.9901_g1_i0
0
NODE_3_length_16360_cov_11.9819_g2_i0
0
NODE_4_length_15645_cov_338.265_g3_i0
0
NODE_5_length_15299_cov_16.4297_g0_i1
0
NODE_6_length_14792_cov_9.32463_g4_i0
0
Levels:
  1. '0'
  2. '1'

Use topGO to identify enriched biological processes in the bioluminescent upper lip

In [145]:
#run the topGO function.
GOdata_bio <- new("topGOdata", ontology = "BP", allGenes = geneList_bio,
                    annot = annFUN.gene2GO, gene2GO = geneID2GO_bio)
Building most specific GOs .....

	( 12645 GO terms found. )


Build GO DAG topology ..........

	( 13427 GO terms and 31130 relations. )


Annotating nodes ...............

	( 15506 genes annotated to the GO terms. )

In [146]:
results_go_bio <- runTest(GOdata_bio, algorithm="weight01", statistic="Fisher")
			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 1881 nontrivial nodes
		 parameters: 
			 test statistic: fisher


	 Level 16:	1 nodes to be scored	(0 eliminated genes)


	 Level 15:	4 nodes to be scored	(0 eliminated genes)


	 Level 14:	8 nodes to be scored	(6 eliminated genes)


	 Level 13:	25 nodes to be scored	(115 eliminated genes)


	 Level 12:	56 nodes to be scored	(265 eliminated genes)


	 Level 11:	94 nodes to be scored	(1648 eliminated genes)


	 Level 10:	139 nodes to be scored	(3396 eliminated genes)


	 Level 9:	203 nodes to be scored	(4708 eliminated genes)


	 Level 8:	243 nodes to be scored	(5830 eliminated genes)


	 Level 7:	302 nodes to be scored	(8018 eliminated genes)


	 Level 6:	315 nodes to be scored	(11122 eliminated genes)


	 Level 5:	254 nodes to be scored	(12818 eliminated genes)


	 Level 4:	149 nodes to be scored	(14217 eliminated genes)


	 Level 3:	70 nodes to be scored	(14877 eliminated genes)


	 Level 2:	17 nodes to be scored	(15171 eliminated genes)


	 Level 1:	1 nodes to be scored	(15489 eliminated genes)

In [147]:
#retrieve the GO enrichment 
goEnrichment_bio   <- GenTable(GOdata_bio, Fisher = results_go_bio, orderBy = "Fisher", topNodes =100, numChar =1000 )
In [148]:
goEnrichment_bio$Fisher <- as.numeric(goEnrichment_bio$Fisher)
goEnrichment_bio <- goEnrichment_bio[goEnrichment_bio$Fisher < 0.05,] 
goEnrichment_bio <- goEnrichment_bio[goEnrichment_bio$Significant > 1,]
In [106]:
#write.table(goEnrichment_bio, "df_TopGO_Vargula_tsujii_DE_unique_BioUpperLip_BP.tsv",sep = "\t", quote=FALSE)
In [36]:
myterms_bio =goEnrichment_bio$GO.ID 
mygenes_bio = genesInTerm(GOdata_bio, myterms_bio)
In [ ]:
#extract the transcript ids for each GO term
for (i in 1:length(myterms_bio))
{
   myterm_bio <- myterms_bio[i]
   mygenesforterm_bio <- mygenes_bio[myterm_bio][[1]]
   myfactor_bio <- mygenesforterm_bio %in% myInterestingGenes_bio 
   mygenesforterm2_bio <- mygenesforterm_bio[myfactor_bio == TRUE]
   mygenesforterm2_bio <- paste(mygenesforterm2_bio, collapse=',')
   print(paste("Term",myterm_bio,"genes:",mygenesforterm2_bio))
}
In [149]:
ntop = 48
ggdata <- goEnrichment_bio[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) 
plot_Bio_UL_BP <- ggplot(ggdata,
  aes(x = Term, y = -log10(Fisher), size = Significant, fill = -log10(Fisher))) +

 ylim(1,11) +
  geom_point(shape = 21) +
  scale_size(range = c(2.5,12.5)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +

  labs(
    title = 'GO Analysis - Biological Process')+
 theme_bw(base_size = 24) +
labs(size= "Number of Genes")+
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 16, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 10, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),

   axis.text.x = element_blank(),
    axis.text.y = element_text(angle = 0, size = 13, face = 'bold', vjust = 0.5),
    #axis.title = element_text(size = 12, face = 'bold'),
    axis.title.x = element_blank(),
    #axis.title.y = element_text(size = 8, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    #Legend
    legend.key = element_blank(), 
    legend.text = element_text(size = 14, face = "bold"), 
    title = element_text(size = 14, face = "bold")) +



  coord_flip()
#dev.off()
In [157]:
plot_Bio_UL_BP + labs(x = NULL)
In [156]:
options(repr.plot.width=10, repr.plot.height=8, repr.plot.res = 500)

Use topGO to identify enriched molecular functions in the bioluminescent upper lip

In [53]:
#run the topGO function.
GOdata_bio_MF <- new("topGOdata", ontology = "MF", allGenes = geneList_bio,
                    annot = annFUN.gene2GO, gene2GO = geneID2GO_bio)
Building most specific GOs .....

	( 3583 GO terms found. )


Build GO DAG topology ..........

	( 3618 GO terms and 4733 relations. )


Annotating nodes ...............

	( 16040 genes annotated to the GO terms. )

In [54]:
results_go_bio_MF <- runTest(GOdata_bio_MF, algorithm="weight01", statistic="Fisher")
			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 401 nontrivial nodes
		 parameters: 
			 test statistic: fisher


	 Level 12:	1 nodes to be scored	(0 eliminated genes)


	 Level 11:	2 nodes to be scored	(0 eliminated genes)


	 Level 10:	3 nodes to be scored	(7 eliminated genes)


	 Level 9:	11 nodes to be scored	(38 eliminated genes)


	 Level 8:	28 nodes to be scored	(58 eliminated genes)


	 Level 7:	54 nodes to be scored	(2776 eliminated genes)


	 Level 6:	79 nodes to be scored	(3524 eliminated genes)


	 Level 5:	92 nodes to be scored	(6256 eliminated genes)


	 Level 4:	88 nodes to be scored	(8602 eliminated genes)


	 Level 3:	33 nodes to be scored	(13147 eliminated genes)


	 Level 2:	9 nodes to be scored	(14316 eliminated genes)


	 Level 1:	1 nodes to be scored	(15865 eliminated genes)

In [55]:
#retrieve the GO enrichment 
goEnrichment_bio_MF   <- GenTable(GOdata_bio_MF, Fisher = results_go_bio_MF, orderBy = "Fisher", topNodes =100, numChar =1000 )
In [56]:
goEnrichment_bio_MF$Fisher <- as.numeric(goEnrichment_bio_MF$Fisher)
goEnrichment_bio_MF <- goEnrichment_bio_MF[goEnrichment_bio_MF$Fisher < 0.05,] 
goEnrichment_bio_MF <- goEnrichment_bio_MF[goEnrichment_bio_MF$Significant > 1,] 
goEnrichment_bio_MF <- goEnrichment_bio_MF[,c("GO.ID","Term", "Annotated","Significant", "Expected", "Fisher")]
In [57]:
goEnrichment_bio_MF
A data.frame: 29 × 6
GO.IDTermAnnotatedSignificantExpectedFisher
<chr><chr><int><int><dbl><dbl>
1GO:0004252serine-type endopeptidase activity 241132.280.0000005
2GO:0004222metalloendopeptidase activity 118 91.120.0000018
3GO:0004089carbonate dehydratase activity 26 50.250.0000040
4GO:0047712Cypridina-luciferin 2-monooxygenase activity 46 60.440.0000045
5GO:0004478methionine adenosyltransferase activity 7 30.070.0000280
6GO:0004598peptidylamidoglycolate lyase activity 20 40.190.0000330
7GO:0004504peptidylglycine monooxygenase activity 20 40.190.0000330
8GO:0008061chitin binding 119 71.130.0001400
9GO:0030343vitamin D3 25-hydroxylase activity 3 20.030.0002700
10GO:0004867serine-type endopeptidase inhibitor activity 95 60.900.0002800
11GO:0004181metallocarboxypeptidase activity 36 40.340.0003600
12GO:0016831carboxy-lyase activity 101 60.960.0004000
13GO:0005184neuropeptide hormone activity 5 30.050.0005200
14GO:0004574oligo-1,6-glucosidase activity 5 20.050.0008800
15GO:0004575sucrose alpha-glucosidase activity 5 20.050.0008800
16GO:0004656procollagen-proline 4-dioxygenase activity 28 30.270.0023000
17GO:0004064arylesterase activity 9 20.090.0030700
18GO:0005507copper ion binding 65 40.620.0033500
19GO:0046422violaxanthin de-epoxidase activity 10 20.090.0038200
20GO:0031418L-ascorbic acid binding 41 30.390.0068300
21GO:0005509calcium ion binding 640136.060.0078900
23GO:0080030methyl indole-3-acetate esterase activity 17 20.160.0110500
25GO:0015459potassium channel regulator activity 24 20.230.0214800
26GO:0016702oxidoreductase activity, acting on single donors with incorporation of molecular oxygen, incorporation of two atoms of oxygen 63 30.600.0219000
27GO:0015151alpha-glucoside transmembrane transporter activity 26 20.250.0249900
28GO:0015574trehalose transmembrane transporter activity 26 20.250.0249900
29GO:0018833DDT-dehydrochlorinase activity 26 20.250.0249900
34GO:0005506iron ion binding 265 62.510.0405900
35GO:0008395steroid hydroxylase activity 34 20.320.0410700
In [60]:
ntop = 29
ggdata <- goEnrichment_bio_MF[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) 
plot_Bio_UL_MF <- ggplot(ggdata,
  aes(x = Term, y = -log10(Fisher), size = Significant, fill = -log10(Fisher))) +

 ylim(1,11) +
  geom_point(shape = 21) +
  scale_size(range = c(2.5,12.5)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +

  labs(
    title = 'GO Analysis - Molecular Function')+
 theme_bw(base_size = 24) +
labs(size= "Number of Genes")+
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 16, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 10, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),

   axis.text.x = element_blank(),
    axis.text.y = element_text(angle = 0, size = 13, face = 'bold', vjust = 0.5),
    #axis.title = element_text(size = 12, face = 'bold'),
    axis.title.x = element_blank(),
    #axis.title.y = element_text(size = 8, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    #Legend
    legend.key = element_blank(), 
    legend.text = element_text(size = 14, face = "bold"), 
    title = element_text(size = 14, face = "bold")) +



  coord_flip()
#dev.off()
In [70]:
plot_Bio_UL_MF + labs(x=NULL)
In [69]:
options(repr.plot.width=16, repr.plot.height=8, repr.plot.res = 500)

Supp Material for Non-Luminous Ostracod Skogsbergia sp.

Load data

Import Skogsbergia sp. gene expression matrix which consists of three tissue types - upper lip, compound eye and gut - with five biological replicates for each tissue. Generate the sample name sheet (meta sheet) for downstream DESeq2 analyses. Read in the annotation file for Skogsbergia sp. generated by Trinotate.

In [2]:
#read in gene expression matrix 
skogs_counts <- read.delim("skogs_fasta90_isoform_combined.tab", header = TRUE, sep = "\t", quote = "")
In [3]:
#fix the column names 
row.names(skogs_counts) <-skogs_counts$X
In [4]:
#remove the column X and extra X.1 
skogs_counts$X.1 <- NULL 
skogs_counts$X <- NULL
In [5]:
meta <- data.frame(row.names = colnames(skogs_counts))
In [6]:
head(meta)
A data.frame: 6 × 0
Sk.10A_fasta90_isoform.counts.tab
Sk.10B_fasta90_isoform.counts.tab
Sk.10C_fasta90_isoform.counts.tab
Sk.6A_fasta90_isoform.counts.tab
Sk.6B_fasta90_isoform.counts.tab
Sk.6C_fasta90_isoform..counts.tab
In [7]:
sample_name = c("Upper_lip", "Eye", "Gut", "Upper_lip", "Eye", "Gut", "Upper_lip", "Eye", "Gut", "Upper_lip", "Eye", "Gut","Upper_lip", "Eye", "Gut")
In [8]:
meta$sample_name <- sample_name
In [9]:
meta$names <- rownames(meta)
In [10]:
rownames(meta) <- NULL

QC for downstream analyses

In [13]:
#import count table, meta and sample_name into a DESeq2 object
dds_count_table <- DESeqDataSetFromMatrix(countData = skogs_counts, colData = meta, design = ~sample_name)
Warning message in DESeqDataSet(se, design = design, ignoreRank):
“some variables in design formula are characters, converting to factors”
In [14]:
nrow(counts(dds_count_table))
81092

Visualize transformed expression matrix with hierarchical clustering and PCA

In [16]:
# run DESeq2 function and normalization  
dds_raw_counts <- DESeq(dds_count_table, betaPrior = FALSE, parallel = TRUE) 
# Perform a variance-stabilizing transformation
vsd_raw_counts <- varianceStabilizingTransformation(dds_raw_counts)
estimating size factors

estimating dispersions

gene-wise dispersion estimates: 38 workers

mean-dispersion relationship

final dispersion estimates, fitting model and testing: 38 workers

In [17]:
sampleDists_raw_counts <- dist(t(assay(vsd_raw_counts)))
In [18]:
#plot the heatmap
sampleDists_raw_counts_Matrix <- as.matrix(sampleDists_raw_counts)
rownames(sampleDists_raw_counts_Matrix) <- paste(colData(dds_raw_counts)$sample_name) 
colnames(sampleDists_raw_counts_Matrix) <- colData(dds_raw_counts)$names
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDists_raw_counts_Matrix,
          clustering_distance_rows=sampleDists_raw_counts,
         clustering_distance_cols=sampleDists_raw_counts,
         col=colors)
In [22]:
#plot the PCA
pcaData_raw_counts <- plotPCA(vsd_raw_counts, intgroup="sample_name", returnData=TRUE)
percentVar_raw_counts <- round(100 * attr(pcaData_raw_counts, "percentVar"))
ggplot(pcaData_raw_counts, aes(PC1, PC2, color=sample_name, shape=sample_name)) +
  geom_point(size=5) +
  xlab(paste0("PC1: ",percentVar_raw_counts[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar_raw_counts[2],"% variance")) + 
   geom_point(size=5) +
theme(panel.grid.major = element_line(colour = "gray97",  size = 1), panel.grid.minor = element_line(linetype = "dotted"), panel.background = element_rect(fill = NA), 
   legend.key = element_rect(fill = "gray100")) + theme(axis.line = element_line(size = 0.5,linetype = "solid")) + 
  theme(panel.border = element_rect(colour = "black", fill=NA, size=1)) +
coord_fixed() +
scale_color_manual(values = c('#A5D6A7','#CE93D8', '#81D4FA'))+theme(
    legend.title = element_text(size = 16),
    legend.text = element_text(size = 14),  
    axis.title.x = element_text(size = 16),  
    axis.title.y = element_text(size = 16),  
    axis.text = element_text(size = 12)  
  )
In [21]:
options(repr.plot.width=8, repr.plot.height=6, repr.plot.res = 150)

Import the annotation for Skogsbergia sp. transcriptome

In [23]:
#read in the Trinotate sheet for Skogsbergia sp. 

Trinotate_lym_subset_skogs <- read.csv(file = "Trinotate_lym_subset_Skogsbergia_cdhit90_longestisoform.csv")
In [24]:
Trinotate_lym_subset_skogs$X <- NULL
In [25]:
colnames(Trinotate_lym_subset_skogs)[1]<- "gene_id"

Differential gene expression

Differential gene expression analysis was carried out in DESeq2 (Love et al., 2014). For Skogsbergia sp. , we determined differentially upregulated genes in three tissue types - upper lip, compound eye and gut - using five biological replicates for each tissue. DESeq2 was employed using a p-value < 0.05 and FC > 1.5 for the significance of differentially expressed genes using the Benjamini-Hochberg method to account for false discovery rate (FDR). Pairwise comparisons were done across tissue types (i.e., upper lip to compound eye, upper lip to gut, gut to compound eye). To determine tissue-specific differential upregulation, each tissue was compared to the other two (e.g., genes upregulated uniquely in the upper lip were determined by comparing upper lip expression to both compound eye and gut). For each pairwise comparison, the reference tissue was specified to determine the significantly upregulated genes in each tissue type (i.e., positive vs negative log2fold change). For example,in the upper lip vs eye comparison, if a gene has a positive log2fold change, then that gene is upregulated in the upper lip compared to the eye. Conversely, if a gene has a negative log2fold change then the gene is upregulated in eye compared to the upper lip. Genes that were upregulated uniquely in each tissue type were used for all downstream analyses.

In [26]:
dds_count_table
class: DESeqDataSet 
dim: 81092 15 
metadata(1): version
assays(1): counts
rownames(81092): TRINITY_DN0_c0_g1_i2 TRINITY_DN0_c0_g3_i1 ...
  TRINITY_DN99_c4_g1_i1 TRINITY_DN9_c0_g1_i1
rowData names(0):
colnames(15): Sk.10A_fasta90_isoform.counts.tab
  Sk.10B_fasta90_isoform.counts.tab ...
  Sk.9B_fasta90_isoform..counts.tab Sk.9C_fasta90_isoform.counts.tab
colData names(2): sample_name names
In [27]:
#run DESeq2 analysis
dds_DE <- DESeq(dds_count_table)
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

DGE - Upper Lip vs Compound Eye

In [28]:
#set Eyes as a reference tissue 
dds_DE$sample_name <- relevel(dds_DE$sample_name, ref= "Eye")
In [29]:
#rerun DESeq command after reference is specified 
dds_DE <- DESeq(dds_DE)
using pre-existing size factors

estimating dispersions

found already estimated dispersions, replacing these

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

In [30]:
#define contrasts, extract results table, and shrink the log2 fold changes

res_tableOE_unshrunken_UpperlipVs_Eye <- results(dds_DE, contrast= c("sample_name", "Upper_lip", "Eye") , alpha = 0.05)


res_tableOE_UpperlipVs_Eye <- lfcShrink(dds_DE, contrast= c("sample_name", "Upper_lip", "Eye"), res = res_tableOE_unshrunken_UpperlipVs_Eye)
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

In [31]:
mcols(res_tableOE_UpperlipVs_Eye, use.names=T)
DataFrame with 6 rows and 2 columns
                       type
                <character>
baseMean       intermediate
log2FoldChange      results
lfcSE               results
stat                results
pvalue              results
padj                results
                                                        description
                                                        <character>
baseMean                  mean of normalized counts for all samples
log2FoldChange log2 fold change (MAP): sample_name Upper_lip vs Eye
lfcSE                  standard error: sample_name Upper_lip vs Eye
stat                   Wald statistic: sample name Upper lip vs Eye
pvalue              Wald test p-value: sample name Upper lip vs Eye
padj                                           BH adjusted p-values
In [32]:
#set thresholds
#lfc.cutoff value of 0.58 translates to a 1.5-fold change in expression
#padj.cutoff value of 0.05 

padj.cutoff <- 0.05
lfc.cutoff <- 0.58
In [35]:
#convert to tibble
res_tableOE_UpperlipVs_Eye_tb <- res_tableOE_UpperlipVs_Eye %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble()
In [37]:
#to determine all differentially expressed genes that meet the cutoff values (i.e., padj.cutoff of 0.05 and lfc.cutoff of 0.58, both upregulated and downregulated)
sigOE_UpperlipVs_Eye <- res_tableOE_UpperlipVs_Eye_tb %>% filter(padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
In [38]:
colnames(sigOE_UpperlipVs_Eye)[1]<- "transcript_id"
In [39]:
#extract all genes that are significantly upregulated in the Upper Lip (positive log2 fold change)
Upperlip_Vs_eye_sigOE_UPREGULATED_logfold <- sigOE_UpperlipVs_Eye %>% filter(padj < padj.cutoff & log2FoldChange > lfc.cutoff)
In [40]:
#add the annotation  
Upperlip_Vs_eye_sigOE_UPREGULATED_logfold_annot <- setDT(Trinotate_lym_subset_skogs, key = 'transcript_id')[J(Upperlip_Vs_eye_sigOE_UPREGULATED_logfold)]
In [41]:
#extract all genes that are significantly upregulated in the Compound Eye (negative log2fold change) compared to the Upper Lip.

Upperlip_Vs_eye_sigOE_DOWNREGULATED_logfold <- sigOE_UpperlipVs_Eye %>% filter(padj < padj.cutoff & log2FoldChange < -lfc.cutoff)
In [42]:
#save these two dataframes for downstream analysis in Section 4.4 

#genes that are significantly upregulated in the Upper Lip (positive log2 fold change)
head(Upperlip_Vs_eye_sigOE_UPREGULATED_logfold)

#genes that significantly upregulated in the Compound Eye (negative log2fold change) compared to the Upper Lip.
head(Upperlip_Vs_eye_sigOE_DOWNREGULATED_logfold)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN100127_c0_g1_i1 3.6082284.4801291.09977474.0134245.984419e-050.007626394
TRINITY_DN101713_c0_g1_i1 9.1933903.7067620.89531103.9970826.412797e-050.008066841
TRINITY_DN10190_c0_g1_i1 3.5963753.4047441.00775763.4064406.581610e-040.042200953
TRINITY_DN101_c0_g1_i6 2916.0398453.0001340.82034653.6194632.952156e-040.023914234
TRINITY_DN101_c0_g4_i1 389.6503512.9992870.69841274.2696121.958133e-050.003016171
TRINITY_DN10265_c0_g2_i1 3.8833604.4428630.97654214.5370305.705196e-060.001111423
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN10153_c1_g1_i1 47.303025-1.9283670.5315456-3.6184802.963379e-042.391423e-02
TRINITY_DN1015_c0_g1_i4 23.418584-6.4019831.4252019-3.8506511.178042e-041.255415e-02
TRINITY_DN10186_c1_g1_i1 4.266713-4.4891301.2628190-3.4769595.071345e-043.549361e-02
TRINITY_DN10299_c1_g1_i1 8.938083-3.5469560.8856172-3.8555511.154693e-041.234835e-02
TRINITY_DN103043_c0_g1_i173.858175-6.8231590.7991789-8.1189364.702892e-168.461056e-13
TRINITY_DN1040_c0_g1_i5 61.368383-5.2458970.6502786-7.9434741.965959e-153.340493e-12

DGE - Gut vs Compound Eye

In [43]:
#define contrasts, extract results table, and shrink the log2 fold changes

res_tableOE_unshrunken_GutVsEye <- results(dds_DE, contrast= c("sample_name", "Gut", "Eye") , alpha = 0.05)


res_tableOE_GutVsEye <- lfcShrink(dds_DE, contrast= c("sample_name", "Gut", "Eye"), res = res_tableOE_unshrunken_GutVsEye)
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

In [45]:
mcols(res_tableOE_GutVsEye, use.names=T)
DataFrame with 6 rows and 2 columns
                       type                                    description
                <character>                                    <character>
baseMean       intermediate      mean of normalized counts for all samples
log2FoldChange      results log2 fold change (MAP): sample_name Gut vs Eye
lfcSE               results         standard error: sample_name Gut vs Eye
stat                results         Wald statistic: sample name Gut vs Eye
pvalue              results      Wald test p-value: sample name Gut vs Eye
padj                results                           BH adjusted p-values
In [46]:
#convert to tibble
res_tableOE_tb_GutVsEye <- res_tableOE_GutVsEye %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble()
In [47]:
#determine all differentially expressed genes that meet the cutoff values (i.e., padj.cutoff of 0.05 and lfc.cutoff of 0.58, both upregulated and downregulated)
sigOE_GutVsEye <- res_tableOE_tb_GutVsEye %>% filter(padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
In [48]:
colnames(sigOE_GutVsEye)[1]<- "transcript_id"
In [49]:
#extract all genes that are significantly upregulated in the Gut (positive log2fold change)
GutVsEye_sigOE_UPREGULATED_logfold <- sigOE_GutVsEye %>% filter(padj < padj.cutoff & log2FoldChange > lfc.cutoff)
In [50]:
#add the annotation
GutVsEye_sigOE_UPREGULATED_logfold_annot <- setDT(Trinotate_lym_subset_skogs, key = 'transcript_id')[J(GutVsEye_sigOE_UPREGULATED_logfold)]
In [51]:
#extract all genes that are significantly upregulated in the Compound Eye (negative log2fold change) compared to the Gut.  
GutVsEye_sigOE_DOWNREGULATED_logfold <- sigOE_GutVsEye %>% filter(padj < padj.cutoff & log2FoldChange < -lfc.cutoff)
In [52]:
#save these two dataframes for downstream analysis

#genes that are significantly upregulated in the Gut (positive log2 fold change)
head(GutVsEye_sigOE_UPREGULATED_logfold)

#genes that are significantly upregulated in the Compound Eye (negative log2fold change) but that are downregulated in the gut 
head(GutVsEye_sigOE_DOWNREGULATED_logfold)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN0_c0_g1_i2 87.7281344.1161020.75983925.3798697.454017e-084.057540e-06
TRINITY_DN0_c0_g4_i2 223.7656622.7134500.82052883.2924729.931093e-041.222380e-02
TRINITY_DN10004_c4_g1_i1 1.7990573.4721971.04587783.2993939.689408e-041.200249e-02
TRINITY_DN10008_c0_g1_i1 8.6055744.9269321.25191273.9178178.935435e-051.798068e-03
TRINITY_DN100098_c0_g1_i1 17.8356827.0633790.92348007.3637761.787794e-133.867227e-11
TRINITY_DN1000_c1_g1_i2 134.1704101.3700240.38040163.6002183.179503e-044.979644e-03
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN10002_c1_g1_i1 2.520034-3.3738721.1281647-2.8969793.767749e-033.257075e-02
TRINITY_DN10016_c3_g1_i2 46.654265-2.0215070.5571204-3.6281272.854847e-044.588174e-03
TRINITY_DN10036_c0_g1_i1 12.458310-2.0761700.6114870-3.3909286.965633e-049.174321e-03
TRINITY_DN10044_c1_g1_i1 2.893933-3.5092211.0126332-3.3160959.128492e-041.143451e-02
TRINITY_DN1004_c0_g1_i5 793.893694-3.9757480.7871186-5.0736443.902699e-071.732798e-05
TRINITY_DN10062_c0_g1_i1 9.145714-3.7621791.0732028-3.4712065.181254e-047.298007e-03

DGE - Upper Lip vs Gut

In [53]:
#now set gut as a reference tissue 
dds_DE$sample_name <- relevel(dds_DE$sample_name, ref= "Gut")
In [54]:
#rerun DESeq 
dds_DE <- DESeq(dds_DE)
using pre-existing size factors

estimating dispersions

found already estimated dispersions, replacing these

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

In [55]:
#define contrasts, extract results table, and shrink the log2 fold changes

res_tableOE_unshrunken_UpperLipVsGut <- results(dds_DE, contrast= c("sample_name", "Upper_lip", "Gut") , alpha = 0.05)


res_tableOE_UpperLipVsGut <- lfcShrink(dds_DE, contrast= c("sample_name", "Upper_lip", "Gut"), res = res_tableOE_unshrunken_UpperLipVsGut)
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

In [56]:
mcols(res_tableOE_UpperLipVsGut, use.names=T)
DataFrame with 6 rows and 2 columns
                       type
                <character>
baseMean       intermediate
log2FoldChange      results
lfcSE               results
stat                results
pvalue              results
padj                results
                                                        description
                                                        <character>
baseMean                  mean of normalized counts for all samples
log2FoldChange log2 fold change (MAP): sample_name Upper_lip vs Gut
lfcSE                  standard error: sample_name Upper_lip vs Gut
stat                   Wald statistic: sample name Upper lip vs Gut
pvalue              Wald test p-value: sample name Upper lip vs Gut
padj                                           BH adjusted p-values
In [ ]:
# convert to tibble
res_tableOE_tb_UpperLipVsGut <- res_tableOE_UpperLipVsGut %>%  data.frame() %>% rownames_to_column(var="gene") %>%  as_tibble()
In [58]:
#to determine all differentially expressed genes that meet the cutoff values (i.e., padj.cutoff of 0.05 and lfc.cutoff of 0.58, both upregulated and downregulated)
sigOE_UpperLipVsGut <- res_tableOE_tb_UpperLipVsGut %>% filter(padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
In [59]:
#extract all genes that are significantly upregulated in the upper lip (positive log2 fold change)
UpperLipVsGut_sigOE_UPREGULATED_logfold <- sigOE_UpperLipVsGut %>% filter(padj < padj.cutoff & log2FoldChange > lfc.cutoff)
In [60]:
#genes that are significantly upregulated in the upper lip (positive log2 fold change)
head(UpperLipVsGut_sigOE_UPREGULATED_logfold )
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN100127_c0_g1_i1 3.6082282.7293120.93417222.8803013.972962e-033.335116e-02
TRINITY_DN10016_c3_g1_i2 46.6542652.2776620.55515944.1020074.095821e-058.875274e-04
TRINITY_DN10031_c3_g1_i1 16.5440371.8336040.36990024.9545387.250222e-073.064050e-05
TRINITY_DN100493_c0_g1_i1 4.2492533.5714581.14549133.0469272.311939e-032.223371e-02
TRINITY_DN1004_c0_g1_i5 793.8936945.1949430.78684776.6143983.730678e-115.611549e-09
TRINITY_DN10062_c0_g1_i1 9.1457143.5319211.07377703.2682081.082309e-031.250321e-02
In [61]:
#extract all genes that are significantly upregulated in the Gut (negative log2fold change) compared to the upper lip
UpperLipVsGut_sigOE_DOWNREGULATED_logfold <- sigOE_UpperLipVsGut %>%
        filter(padj < padj.cutoff & log2FoldChange < -lfc.cutoff)
In [62]:
#extract all genes that are significantly upregulated in the Gut (negative log2fold change) compared to the Upper Lip.  
head(UpperLipVsGut_sigOE_DOWNREGULATED_logfold)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN0_c0_g1_i2 87.72813-4.7596780.7690108-6.1373138.392890e-108.887781e-08
TRINITY_DN0_c0_g3_i1 20.64123-2.2956630.8282384-2.7584285.808017e-034.411316e-02
TRINITY_DN0_c0_g4_i2 223.76566-3.9434880.8241288-4.7748251.798639e-066.543283e-05
TRINITY_DN100098_c0_g1_i1 17.83568-6.2745310.8954141-6.4896948.601097e-111.178318e-08
TRINITY_DN1000_c1_g1_i2 134.17041-1.5204510.3803992-3.9958546.446143e-051.293863e-03
TRINITY_DN1000_c2_g1_i2 13.27825-1.8062580.5411564-3.3247858.848688e-041.073042e-02
In [63]:
#save these two dataframes for downstream analysis

#genes that are significantly upregulated in the Upper Lip (positive log2fold change)
head(UpperLipVsGut_sigOE_UPREGULATED_logfold)

#genes that are significantly upregulated in the Gut (negative log2fold change) compared to the Upper Lip  
head(UpperLipVsGut_sigOE_DOWNREGULATED_logfold)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN100127_c0_g1_i1 3.6082282.7293120.93417222.8803013.972962e-033.335116e-02
TRINITY_DN10016_c3_g1_i2 46.6542652.2776620.55515944.1020074.095821e-058.875274e-04
TRINITY_DN10031_c3_g1_i1 16.5440371.8336040.36990024.9545387.250222e-073.064050e-05
TRINITY_DN100493_c0_g1_i1 4.2492533.5714581.14549133.0469272.311939e-032.223371e-02
TRINITY_DN1004_c0_g1_i5 793.8936945.1949430.78684776.6143983.730678e-115.611549e-09
TRINITY_DN10062_c0_g1_i1 9.1457143.5319211.07377703.2682081.082309e-031.250321e-02
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN0_c0_g1_i2 87.72813-4.7596780.7690108-6.1373138.392890e-108.887781e-08
TRINITY_DN0_c0_g3_i1 20.64123-2.2956630.8282384-2.7584285.808017e-034.411316e-02
TRINITY_DN0_c0_g4_i2 223.76566-3.9434880.8241288-4.7748251.798639e-066.543283e-05
TRINITY_DN100098_c0_g1_i1 17.83568-6.2745310.8954141-6.4896948.601097e-111.178318e-08
TRINITY_DN1000_c1_g1_i2 134.17041-1.5204510.3803992-3.9958546.446143e-051.293863e-03
TRINITY_DN1000_c2_g1_i2 13.27825-1.8062580.5411564-3.3247858.848688e-041.073042e-02

Determine tissue-specific expression

To identify tissue-specific differential expression (i.e., significantly upregulated genes that are uniquely expressed), each tissue was compared to the other two and tissue-specific genes were extracted from the intersection of the Venn diagram.Genes that were upregulated uniquely in each tissue type were used for all downstream analyses in the publication.

Upper lip

In [64]:
#combine genes upregulated in upper lips (Upper Lip vs Gut and Upper Lip vs Eye) from both pairwise comparisons into a single dataframe

head(UpperLipVsGut_sigOE_UPREGULATED_logfold)

head(Upperlip_Vs_eye_sigOE_UPREGULATED_logfold)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN100127_c0_g1_i1 3.6082282.7293120.93417222.8803013.972962e-033.335116e-02
TRINITY_DN10016_c3_g1_i2 46.6542652.2776620.55515944.1020074.095821e-058.875274e-04
TRINITY_DN10031_c3_g1_i1 16.5440371.8336040.36990024.9545387.250222e-073.064050e-05
TRINITY_DN100493_c0_g1_i1 4.2492533.5714581.14549133.0469272.311939e-032.223371e-02
TRINITY_DN1004_c0_g1_i5 793.8936945.1949430.78684776.6143983.730678e-115.611549e-09
TRINITY_DN10062_c0_g1_i1 9.1457143.5319211.07377703.2682081.082309e-031.250321e-02
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN100127_c0_g1_i1 3.6082284.4801291.09977474.0134245.984419e-050.007626394
TRINITY_DN101713_c0_g1_i1 9.1933903.7067620.89531103.9970826.412797e-050.008066841
TRINITY_DN10190_c0_g1_i1 3.5963753.4047441.00775763.4064406.581610e-040.042200953
TRINITY_DN101_c0_g1_i6 2916.0398453.0001340.82034653.6194632.952156e-040.023914234
TRINITY_DN101_c0_g4_i1 389.6503512.9992870.69841274.2696121.958133e-050.003016171
TRINITY_DN10265_c0_g2_i1 3.8833604.4428630.97654214.5370305.705196e-060.001111423
In [67]:
colnames(UpperLipVsGut_sigOE_UPREGULATED_logfold)[1]<- "gene"
In [68]:
colnames(Upperlip_Vs_eye_sigOE_UPREGULATED_logfold)[1]<- "gene"
In [69]:
#merge
merged_upper_lips_df <- rbind(UpperLipVsGut_sigOE_UPREGULATED_logfold, Upperlip_Vs_eye_sigOE_UPREGULATED_logfold)
In [70]:
#remove gene duplicates while retaining one duplicate due to the pairwise comparisons (Upper Lip vs Gut and Upper Lip vs Eye) 
merged_upper_lips_unique <- merged_upper_lips_df[!duplicated(merged_upper_lips_df$gene), ]
In [71]:
head(merged_upper_lips_unique)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN100127_c0_g1_i1 3.6082282.7293120.93417222.8803013.972962e-033.335116e-02
TRINITY_DN10016_c3_g1_i2 46.6542652.2776620.55515944.1020074.095821e-058.875274e-04
TRINITY_DN10031_c3_g1_i1 16.5440371.8336040.36990024.9545387.250222e-073.064050e-05
TRINITY_DN100493_c0_g1_i1 4.2492533.5714581.14549133.0469272.311939e-032.223371e-02
TRINITY_DN1004_c0_g1_i5 793.8936945.1949430.78684776.6143983.730678e-115.611549e-09
TRINITY_DN10062_c0_g1_i1 9.1457143.5319211.07377703.2682081.082309e-031.250321e-02
In [125]:
#write.csv(merged_upper_lips_unique, file = "SKOGS_merged_upper_lips_ALL_DE_genes.csv")

Compound eye

In [72]:
#combine genes upregulated in compound eye (Upper Lip vs Compound Eye and Gut vs Compound Eye) from both pairwise comparisons into a single dataframe

#genes that significantly upregulated in the Compound Eye
head(Upperlip_Vs_eye_sigOE_DOWNREGULATED_logfold)

#genes that are significantly upregulated in the Compound Eye
head(GutVsEye_sigOE_DOWNREGULATED_logfold)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN10153_c1_g1_i1 47.303025-1.9283670.5315456-3.6184802.963379e-042.391423e-02
TRINITY_DN1015_c0_g1_i4 23.418584-6.4019831.4252019-3.8506511.178042e-041.255415e-02
TRINITY_DN10186_c1_g1_i1 4.266713-4.4891301.2628190-3.4769595.071345e-043.549361e-02
TRINITY_DN10299_c1_g1_i1 8.938083-3.5469560.8856172-3.8555511.154693e-041.234835e-02
TRINITY_DN103043_c0_g1_i173.858175-6.8231590.7991789-8.1189364.702892e-168.461056e-13
TRINITY_DN1040_c0_g1_i5 61.368383-5.2458970.6502786-7.9434741.965959e-153.340493e-12
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN10002_c1_g1_i1 2.520034-3.3738721.1281647-2.8969793.767749e-033.257075e-02
TRINITY_DN10016_c3_g1_i2 46.654265-2.0215070.5571204-3.6281272.854847e-044.588174e-03
TRINITY_DN10036_c0_g1_i1 12.458310-2.0761700.6114870-3.3909286.965633e-049.174321e-03
TRINITY_DN10044_c1_g1_i1 2.893933-3.5092211.0126332-3.3160959.128492e-041.143451e-02
TRINITY_DN1004_c0_g1_i5 793.893694-3.9757480.7871186-5.0736443.902699e-071.732798e-05
TRINITY_DN10062_c0_g1_i1 9.145714-3.7621791.0732028-3.4712065.181254e-047.298007e-03
In [73]:
colnames(Upperlip_Vs_eye_sigOE_DOWNREGULATED_logfold)[1]<- "gene"
In [74]:
colnames(GutVsEye_sigOE_DOWNREGULATED_logfold)[1]<- "gene"
In [75]:
#merged eye
merged_Eye_df <- rbind(Upperlip_Vs_eye_sigOE_DOWNREGULATED_logfold , GutVsEye_sigOE_DOWNREGULATED_logfold)
In [76]:
head(merged_Eye_df)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN10153_c1_g1_i1 47.303025-1.9283670.5315456-3.6184802.963379e-042.391423e-02
TRINITY_DN1015_c0_g1_i4 23.418584-6.4019831.4252019-3.8506511.178042e-041.255415e-02
TRINITY_DN10186_c1_g1_i1 4.266713-4.4891301.2628190-3.4769595.071345e-043.549361e-02
TRINITY_DN10299_c1_g1_i1 8.938083-3.5469560.8856172-3.8555511.154693e-041.234835e-02
TRINITY_DN103043_c0_g1_i173.858175-6.8231590.7991789-8.1189364.702892e-168.461056e-13
TRINITY_DN1040_c0_g1_i5 61.368383-5.2458970.6502786-7.9434741.965959e-153.340493e-12
In [77]:
#remove gene duplicates while retaining one duplicate due to the pairwise comparisons (Upper Lip vs Compound Eye and Gut vs Compound Eye). 
merged_Eye_unique <-merged_Eye_df[!duplicated(merged_Eye_df$gene), ]
In [78]:
head(merged_Eye_unique)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN10153_c1_g1_i1 47.303025-1.9283670.5315456-3.6184802.963379e-042.391423e-02
TRINITY_DN1015_c0_g1_i4 23.418584-6.4019831.4252019-3.8506511.178042e-041.255415e-02
TRINITY_DN10186_c1_g1_i1 4.266713-4.4891301.2628190-3.4769595.071345e-043.549361e-02
TRINITY_DN10299_c1_g1_i1 8.938083-3.5469560.8856172-3.8555511.154693e-041.234835e-02
TRINITY_DN103043_c0_g1_i173.858175-6.8231590.7991789-8.1189364.702892e-168.461056e-13
TRINITY_DN1040_c0_g1_i5 61.368383-5.2458970.6502786-7.9434741.965959e-153.340493e-12
In [126]:
#write.csv(merged_Eye_unique, file = "SKOGS_merged_Eye_ALL_DE_genes.csv")

Gut

In [79]:
#combine genes upregulated in Gut (Gut vs Eye and Upper lip vs Gut) from both pairwise comparisons into a single dataframe

#genes that are significantly upregulated in the Gut
head(GutVsEye_sigOE_UPREGULATED_logfold)

#genes that are significantly upregulated in the Gut 
head(UpperLipVsGut_sigOE_DOWNREGULATED_logfold)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN0_c0_g1_i2 87.7281344.1161020.75983925.3798697.454017e-084.057540e-06
TRINITY_DN0_c0_g4_i2 223.7656622.7134500.82052883.2924729.931093e-041.222380e-02
TRINITY_DN10004_c4_g1_i1 1.7990573.4721971.04587783.2993939.689408e-041.200249e-02
TRINITY_DN10008_c0_g1_i1 8.6055744.9269321.25191273.9178178.935435e-051.798068e-03
TRINITY_DN100098_c0_g1_i1 17.8356827.0633790.92348007.3637761.787794e-133.867227e-11
TRINITY_DN1000_c1_g1_i2 134.1704101.3700240.38040163.6002183.179503e-044.979644e-03
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN0_c0_g1_i2 87.72813-4.7596780.7690108-6.1373138.392890e-108.887781e-08
TRINITY_DN0_c0_g3_i1 20.64123-2.2956630.8282384-2.7584285.808017e-034.411316e-02
TRINITY_DN0_c0_g4_i2 223.76566-3.9434880.8241288-4.7748251.798639e-066.543283e-05
TRINITY_DN100098_c0_g1_i1 17.83568-6.2745310.8954141-6.4896948.601097e-111.178318e-08
TRINITY_DN1000_c1_g1_i2 134.17041-1.5204510.3803992-3.9958546.446143e-051.293863e-03
TRINITY_DN1000_c2_g1_i2 13.27825-1.8062580.5411564-3.3247858.848688e-041.073042e-02
In [80]:
colnames(GutVsEye_sigOE_UPREGULATED_logfold)[1]<- "gene"
In [81]:
colnames(UpperLipVsGut_sigOE_DOWNREGULATED_logfold)[1]<- "gene"
In [82]:
#merge
merged_Gut_df <- rbind(GutVsEye_sigOE_UPREGULATED_logfold , UpperLipVsGut_sigOE_DOWNREGULATED_logfold)
In [83]:
head(merged_Gut_df)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN0_c0_g1_i2 87.7281344.1161020.75983925.3798697.454017e-084.057540e-06
TRINITY_DN0_c0_g4_i2 223.7656622.7134500.82052883.2924729.931093e-041.222380e-02
TRINITY_DN10004_c4_g1_i1 1.7990573.4721971.04587783.2993939.689408e-041.200249e-02
TRINITY_DN10008_c0_g1_i1 8.6055744.9269321.25191273.9178178.935435e-051.798068e-03
TRINITY_DN100098_c0_g1_i1 17.8356827.0633790.92348007.3637761.787794e-133.867227e-11
TRINITY_DN1000_c1_g1_i2 134.1704101.3700240.38040163.6002183.179503e-044.979644e-03
In [84]:
#remove gene duplicates while retaining one duplicate due to the pairwise comparisons (Gut vs Eye and Upper lip vs Gut)
merged_Gut_unique <-merged_Gut_df[!duplicated(merged_Gut_df$gene), ]
In [85]:
head(merged_Gut_unique)
A tibble: 6 × 7
genebaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN0_c0_g1_i2 87.7281344.1161020.75983925.3798697.454017e-084.057540e-06
TRINITY_DN0_c0_g4_i2 223.7656622.7134500.82052883.2924729.931093e-041.222380e-02
TRINITY_DN10004_c4_g1_i1 1.7990573.4721971.04587783.2993939.689408e-041.200249e-02
TRINITY_DN10008_c0_g1_i1 8.6055744.9269321.25191273.9178178.935435e-051.798068e-03
TRINITY_DN100098_c0_g1_i1 17.8356827.0633790.92348007.3637761.787794e-133.867227e-11
TRINITY_DN1000_c1_g1_i2 134.1704101.3700240.38040163.6002183.179503e-044.979644e-03
In [127]:
#write.csv(merged_Gut_unique, file = "SKOGS_merged_Gut_ALL_DE_genes.csv")

Venn Diagram - Extract tissue-specific genes

In [87]:
#generate a venn diagram to visualize shared significantly upregulated genes across tissue types and extract the genes that are unique to each tissue type. 
unique_venn_list <- list(
  Upper_Lip = merged_upper_lips_unique$gene  , 
  Gut = merged_Gut_unique$gene,
  Compound_Eye = merged_Eye_unique$gene
)

ggvenn_unique <- ggvenn(
  unique_venn_list, 
  fill_color = c('#81D4FA','#A5D6A7', '#CE93D8'),
  stroke_size = .7, set_name_size = 6, text_size = 5
)

ggvenn_unique
In [ ]:
# Open a PDF device
pdf("Non_Luminous_DGE.pdf", width = 8, height = 6)


ggvenn_unique 

dev.off()
In [88]:
#prep dataframes for extraction
Upper_Lip <- as.data.frame(merged_upper_lips_unique$gene)
colnames(Upper_Lip)[1]<- "gene"
Gut <- as.data.frame(merged_Gut_unique$gene)
colnames(Gut)[1]<- "gene"
Compound_eye <- as.data.frame(merged_Eye_unique$gene)
colnames(Compound_eye)[1]<- "gene"
In [89]:
# compare and extract unique genes for each tissue type
unique_genes_upper_lip <- anti_join(Upper_Lip, Gut, by = "gene") %>%
  anti_join(Compound_eye, by = "gene")

unique_genes_gut <- anti_join(Gut, Upper_Lip, by = "gene") %>%
  anti_join(Compound_eye, by = "gene")

unique_genes_compound_eye <- anti_join(Compound_eye, Upper_Lip, by = "gene") %>%
  anti_join(Gut, by = "gene")
In [90]:
nrow(unique_genes_upper_lip)
934
In [91]:
nrow(unique_genes_gut)
4266
In [92]:
nrow(unique_genes_compound_eye)
800
In [ ]:
#add the annotations back to the unique genes in each tissue type by subsetting
In [93]:
unique_genes_upper_lip_info  <- merged_upper_lips_unique %>%
  filter(gene %in% unique_genes_upper_lip$gene)
In [95]:
unique_genes_eye_info  <- merged_Eye_unique %>%
  filter(gene %in% unique_genes_compound_eye$gene)
In [97]:
unique_genes_gut_info  <- merged_Gut_unique %>%
  filter(gene %in% unique_genes_gut$gene)
In [99]:
#add annotations back for upper lip
colnames(unique_genes_upper_lip_info)[1]<- "transcript_id"
unique_genes_upper_lip_info_annot <- left_join(unique_genes_upper_lip_info,Trinotate_lym_subset_skogs,by="transcript_id")
In [100]:
#write.csv(unique_genes_upper_lip_info_annot, file = "df_Skogs_sigfig_upreg_unique_Upper_Lip.csv")
In [101]:
#add annotations back for eyes 
colnames(unique_genes_eye_info)[1]<- "transcript_id"
unique_genes_eye_info_annot <- left_join(unique_genes_eye_info,Trinotate_lym_subset_skogs,by="transcript_id")
In [102]:
#write.csv(unique_genes_eye_info_annot, file = "df_Skogs_sigfig_upreg_unique_comEye.csv")
In [103]:
#add annotations back for gut 
colnames(unique_genes_gut_info)[1] <- "transcript_id"
unique_genes_gut_info_annot <- left_join(unique_genes_gut_info,Trinotate_lym_subset_skogs,by="transcript_id")
In [ ]:
#write.csv(unique_genes_gut_info_annot, file = "df_Skogs_sigfig_upreg_unique_Gut.csv")

DGE - GO enrichment analyses for upper lip

Use topGO to identify enriched biological processes in the upper lip

In [104]:
#import the trinotate go sheet from Trinotate output
geneID2GO <- readMappings(file ="go_annotations_cdhit90_longestisoform.txt")
In [105]:
#save the transcript ids of all the annotated genes under geneNames object 
geneNames<- as.character(Trinotate_lym_subset_skogs$transcript_id)
In [106]:
#significantly upregulated genes (i.e. expressed uniquely) in the upper lip 
head(unique_genes_upper_lip_info)
A tibble: 6 × 7
transcript_idbaseMeanlog2FoldChangelfcSEstatpvaluepadj
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
TRINITY_DN100127_c0_g1_i1 3.6082282.7293120.93417222.8803013.972962e-033.335116e-02
TRINITY_DN10031_c3_g1_i1 16.5440371.8336040.36990024.9545387.250222e-073.064050e-05
TRINITY_DN100493_c0_g1_i1 4.2492533.5714581.14549133.0469272.311939e-032.223371e-02
TRINITY_DN10077_c0_g2_i5 22.1542941.2218580.44662482.7354706.229131e-034.641496e-02
TRINITY_DN10099_c0_g1_i3 28.6669732.9850020.91352073.2705001.073577e-031.242571e-02
TRINITY_DN10116_c0_g1_i1 17.0293545.5422071.03557575.1421492.716139e-071.331053e-05
In [107]:
#save the transcript 
myInterestingGenes= as.character(unique_genes_upper_lip_info$transcript_id)
In [108]:
#subset the genesNames by the transcript IDs in my red module 
geneList <- factor(as.integer(geneNames %in% myInterestingGenes))
names(geneList) <- geneNames
head(geneList)
TRINITY_DN0_c0_g1_i2
0
TRINITY_DN0_c0_g3_i1
0
TRINITY_DN0_c0_g4_i2
0
TRINITY_DN100000_c0_g1_i1
0
TRINITY_DN100001_c0_g1_i1
0
TRINITY_DN100002_c0_g1_i1
0
Levels:
  1. '0'
  2. '1'
In [109]:
#run the topGO function for BP 
GOdata <- new("topGOdata", ontology = "BP", allGenes = geneList,
                    annot = annFUN.gene2GO, gene2GO = geneID2GO)
Building most specific GOs .....

	( 15239 GO terms found. )


Build GO DAG topology ..........

	( 15990 GO terms and 37503 relations. )


Annotating nodes ...............

	( 25887 genes annotated to the GO terms. )

In [110]:
results_go <- runTest(GOdata, algorithm="weight01", statistic="Fisher")
			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 2274 nontrivial nodes
		 parameters: 
			 test statistic: fisher


	 Level 19:	1 nodes to be scored	(0 eliminated genes)


	 Level 18:	1 nodes to be scored	(0 eliminated genes)


	 Level 17:	1 nodes to be scored	(2 eliminated genes)


	 Level 16:	3 nodes to be scored	(13 eliminated genes)


	 Level 15:	12 nodes to be scored	(25 eliminated genes)


	 Level 14:	22 nodes to be scored	(133 eliminated genes)


	 Level 13:	41 nodes to be scored	(497 eliminated genes)


	 Level 12:	75 nodes to be scored	(1685 eliminated genes)


	 Level 11:	119 nodes to be scored	(4443 eliminated genes)


	 Level 10:	178 nodes to be scored	(7832 eliminated genes)


	 Level 9:	249 nodes to be scored	(10815 eliminated genes)


	 Level 8:	290 nodes to be scored	(13231 eliminated genes)


	 Level 7:	358 nodes to be scored	(15895 eliminated genes)


	 Level 6:	362 nodes to be scored	(20011 eliminated genes)


	 Level 5:	292 nodes to be scored	(22509 eliminated genes)


	 Level 4:	166 nodes to be scored	(24221 eliminated genes)


	 Level 3:	84 nodes to be scored	(25132 eliminated genes)


	 Level 2:	19 nodes to be scored	(25518 eliminated genes)


	 Level 1:	1 nodes to be scored	(25827 eliminated genes)

In [111]:
#retrieve the GO enrichment 
goEnrichment   <- GenTable(GOdata, Fisher = results_go, orderBy = "Fisher", topNodes = 10000, numChar=1000)
In [112]:
goEnrichment$Fisher <- as.numeric(goEnrichment$Fisher)
goEnrichment <- goEnrichment[goEnrichment$Fisher < 0.05,] 
goEnrichment <- goEnrichment[goEnrichment$Significant > 1,]
In [113]:
myterms =goEnrichment$GO.ID 
mygenes = genesInTerm(GOdata, myterms)
In [ ]:
#extract the transcript ids for each GO term
var=c()
for (i in 1:length(myterms))
{
   myterm <- myterms[i]
   mygenesforterm <- mygenes[myterm][[1]]
   myfactor <- mygenesforterm %in% myInterestingGenes
   mygenesforterm2 <- mygenesforterm[myfactor == TRUE]
   mygenesforterm2 <- paste(mygenesforterm2, collapse=',')
   var[i]=paste(myterm,"genes:",mygenesforterm2)
}
In [114]:
# GO enrichment Upper Lip - BP 

ntop = 50
ggdata <- goEnrichment[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) 
plot_GO_UL_BP <- ggplot(ggdata,
  aes(x = Term, y = -log10(Fisher), size = Significant, fill = -log10(Fisher))) +

  expand_limits(y = 1) +
  geom_point(shape = 21) +
  scale_size(range = c(2,7)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +

  labs(
    title = 'GO Analysis - Biological Process')+
   theme_bw(base_size = 24) +
labs(size= "Number of Genes")+
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 16, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 10, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),

   axis.text.x = element_blank(),
    axis.text.y = element_text(angle = 0, size = 13, face = 'bold', vjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 8, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    legend.key = element_blank(),
    legend.text = element_text(size = 14, face = "bold"), 
    title = element_text(size = 14, face = "bold")) +


  coord_flip()

#dev.off()
In [118]:
plot_GO_UL_BP + labs(x = NULL)
In [117]:
#library (repr)
options(repr.plot.width=12, repr.plot.height=8, repr.plot.res = 500)

Use topGO to identify enriched molecular functions in the upper lip

In [119]:
#run the topGO function 
GOdata_MF <- new("topGOdata", ontology = "MF", allGenes = geneList,
                    annot = annFUN.gene2GO, gene2GO = geneID2GO)
Building most specific GOs .....

	( 4594 GO terms found. )


Build GO DAG topology ..........

	( 4632 GO terms and 6032 relations. )


Annotating nodes ...............

	( 25806 genes annotated to the GO terms. )

In [120]:
results_go_MF <- runTest(GOdata_MF, algorithm="weight01", statistic="Fisher")
			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 397 nontrivial nodes
		 parameters: 
			 test statistic: fisher


	 Level 12:	3 nodes to be scored	(0 eliminated genes)


	 Level 11:	4 nodes to be scored	(0 eliminated genes)


	 Level 10:	5 nodes to be scored	(45 eliminated genes)


	 Level 9:	15 nodes to be scored	(260 eliminated genes)


	 Level 8:	23 nodes to be scored	(2465 eliminated genes)


	 Level 7:	57 nodes to be scored	(6712 eliminated genes)


	 Level 6:	71 nodes to be scored	(7530 eliminated genes)


	 Level 5:	88 nodes to be scored	(10807 eliminated genes)


	 Level 4:	81 nodes to be scored	(14917 eliminated genes)


	 Level 3:	39 nodes to be scored	(21292 eliminated genes)


	 Level 2:	10 nodes to be scored	(23253 eliminated genes)


	 Level 1:	1 nodes to be scored	(25443 eliminated genes)

In [121]:
#retrieve the GO enrichment 
goEnrichment_MF   <- GenTable(GOdata_MF, Fisher = results_go_MF, orderBy = "Fisher", topNodes = 100, numChar=1000)
In [122]:
#lets graph the GO enrichment
goEnrichment_MF$Fisher <- as.numeric(goEnrichment_MF$Fisher)
goEnrichment_MF <- goEnrichment_MF[goEnrichment_MF$Fisher < 0.05,] 
goEnrichment_MF <- goEnrichment_MF[goEnrichment_MF$Significant > 1,]
In [123]:
# GO enrichment Upper Lip - MF 

ntop = 22
ggdata <- goEnrichment_MF[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) 
plot_GO_UL_MF <- ggplot(ggdata,
  aes(x = Term, y = -log10(Fisher), size = Significant, fill = -log10(Fisher))) +

  expand_limits(y = 1) +
  geom_point(shape = 21) +
  scale_size(range = c(2,7)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +

  labs(
    title = 'GO Analysis - Molecular Function')+
   theme_bw(base_size = 24) +
labs(size= "Number of Genes")+
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 16, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 10, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),

   axis.text.x = element_blank(),
    axis.text.y = element_text(angle = 0, size = 13, face = 'bold', vjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 8, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    legend.key = element_blank(),
    legend.text = element_text(size = 14, face = "bold"), 
    title = element_text(size = 14, face = "bold")) +


  coord_flip()

#dev.off()
In [124]:
plot_GO_UL_MF +labs(x=NULL)
In [ ]:
#library (repr)
options(repr.plot.width=8, repr.plot.height=5)

References

P. Langfelder, S. Horvath, WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008)

M. I. Love, W. Huber, S. Anders, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014)

Alexa A, Rahnenfuhrer J, topGO: Enrichment Analysis for Gene Ontology (2024)

Shannon, Paul et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research vol. 13,11 (2003)